NVIDIA, the pioneer in AI and accelerated computing, is seeking a Senior Site Reliability Engineer to join their GPU AI/HPC Infrastructure team. This role is crucial in designing and implementing cutting-edge GPU compute clusters that power NVIDIA's AI research initiatives. As an SRE, you'll be at the forefront of maintaining and optimizing large-scale AI infrastructure, working with some of the most advanced computing systems in the world.
The position offers an opportunity to work with NVIDIA's state-of-the-art GPU technology and contribute to the infrastructure that enables breakthrough AI research. You'll be responsible for ensuring the reliability, efficiency, and performance of massive GPU clusters while implementing automation solutions to enhance researcher productivity. The role combines hands-on technical work with strategic thinking about system architecture and optimization.
The ideal candidate will bring deep expertise in GPU computing, AI infrastructure, and large-scale system operations. You'll work in a culture that values diversity, intellectual curiosity, and problem-solving, with opportunities to collaborate with brilliant minds in the field. The position offers competitive compensation, including a substantial base salary range of $184,000 to $425,500, plus equity and comprehensive benefits.
This is an excellent opportunity for experienced engineers who are passionate about high-performance computing and want to make a significant impact in the AI field. You'll be working with cutting-edge technology, solving complex technical challenges, and contributing to NVIDIA's mission of advancing AI and accelerated computing. The role offers both technical depth and the chance to influence the direction of critical research infrastructure.