NVIDIA, the pioneer in accelerated computing and AI, is seeking a Senior Site Reliability Engineer to join their Hardware Infrastructure Farm team. This role is crucial in designing and implementing cutting-edge compute clusters that power NVIDIA's silicon development. As an SRE, you'll be responsible for building and operating high-reliability, efficient clusters while driving automation and improvements to enhance engineer productivity.
The position combines deep technical expertise in HPC environments with strategic thinking about system interactions. You'll work with technologies like LSF/SLURM, Linux, Docker, and Python to manage complex computing resources. The role emphasizes automation, proactive problem-solving, and maintaining high-reliability systems that directly impact NVIDIA's chip development process.
NVIDIA's culture values diversity, intellectual curiosity, and openness. The company has transformed itself over two decades, from inventing the GPU that revolutionized gaming and graphics to leading AI and machine learning innovation. This role offers the opportunity to work with cutting-edge technology while contributing to NVIDIA's continued innovation in AI and accelerated computing.
The ideal candidate brings 5+ years of experience in large Linux environments, strong expertise in job scheduler administration, and excellent problem-solving abilities. You'll collaborate with diverse teams, automate processes, and directly influence the quality and time-to-market of NVIDIA's next-generation chips. This position offers the chance to work at the intersection of infrastructure management and chip development at a company that consistently pushes technological boundaries.