Senior Site Reliability Engineer - AI Research Clusters

World leader in accelerated computing, pioneering AI and digital twins technology to transform industries.
Santa Clara, CA, USAWestford, MA 01886, USAAustin, TX, USA
$184,000 - $425,500
Site Reliability
Senior Software Engineer
Hybrid
5,000+ Employees
6+ years of experience
AI

Description For Senior Site Reliability Engineer - AI Research Clusters

NVIDIA, the pioneer in AI and accelerated computing, is seeking a Senior Site Reliability Engineer to join their GPU AI/HPC Infrastructure team. This role is crucial in designing and implementing cutting-edge GPU compute clusters that power NVIDIA's AI research initiatives. As an SRE, you'll be at the forefront of maintaining and optimizing large-scale AI infrastructure, working with some of the most advanced computing systems in the world.

The position offers an opportunity to work with NVIDIA's state-of-the-art GPU technology and contribute to the infrastructure that enables breakthrough AI research. You'll be responsible for ensuring the reliability, efficiency, and performance of massive GPU clusters while implementing automation solutions to enhance researcher productivity. The role combines hands-on technical work with strategic thinking about system architecture and optimization.

The ideal candidate will bring deep expertise in GPU computing, AI infrastructure, and large-scale system operations. You'll work in a culture that values diversity, intellectual curiosity, and problem-solving, with opportunities to collaborate with brilliant minds in the field. The position offers competitive compensation, including a substantial base salary range of $184,000 to $425,500, plus equity and comprehensive benefits.

This is an excellent opportunity for experienced engineers who are passionate about high-performance computing and want to make a significant impact in the AI field. You'll be working with cutting-edge technology, solving complex technical challenges, and contributing to NVIDIA's mission of advancing AI and accelerated computing. The role offers both technical depth and the chance to influence the direction of critical research infrastructure.

Last updated 6 days ago

Responsibilities For Senior Site Reliability Engineer - AI Research Clusters

  • Design and implement state-of-the-art GPU compute clusters
  • Optimize cluster operations for maximum reliability, efficiency, and performance
  • Drive foundational improvements and automation to enhance researcher productivity
  • Troubleshoot and diagnose system failures
  • Scale systems through automation
  • Participate in on-call rotation to support production systems
  • Write and review code, develop documentation and capacity plans
  • Manage upgrades and automated rollbacks across all clusters

Requirements For Senior Site Reliability Engineer - AI Research Clusters

Python
Kubernetes
Linux
  • Bachelor's degree in Computer Science, Electrical Engineering or related field
  • 6+ years of experience designing and operating large scale compute infrastructure
  • Operational experience of at least 5K GPUs cluster
  • Deep understanding of GPU computing and AI infrastructure
  • Experience with AI/HPC advanced job schedulers (Slurm)
  • Experience with cluster configuration management tools (BCM or Ansible)
  • Knowledge of container technologies like Docker, Enroot
  • Experience programming in Python and Bash scripting

Benefits For Senior Site Reliability Engineer - AI Research Clusters

Medical Insurance
Equity
  • Competitive base salary
  • Equity compensation
  • Comprehensive benefits package

Interested in this job?

Jobs Related To NVIDIA Senior Site Reliability Engineer - AI Research Clusters

Senior Site Reliability Engineer, Data Science and ML Platforms

Senior Site Reliability Engineer role at NVIDIA focusing on maintaining and scaling data science and ML platforms, requiring expertise in SRE practices and distributed systems.

Senior Site Reliability Engineer - AI Research Clusters

Senior SRE position at NVIDIA focusing on AI research clusters, requiring expertise in GPU computing, cluster management, and automation with 5+ years of experience.

Senior SRE Software Engineer, Storage and Data

Senior SRE Software Engineer position at NVIDIA, focusing on storage infrastructure for DGX Cloud platform, requiring 5+ years of experience in storage systems and reliability engineering.

Senior SRE Software Engineer, Storage and Data

Senior SRE position at NVIDIA focusing on storage infrastructure reliability and performance optimization for DGX Cloud platform.

Sr. Site Reliability Engineer - Top Secret Clearance

Senior Site Reliability Engineer position at SpaceX, requiring Top Secret clearance, focusing on infrastructure automation and DevOps practices for space flight systems.