NVIDIA, the global leader in AI and accelerated computing, is seeking a Senior Site Reliability Engineer to join their Hardware Infrastructure Farm team. This role combines the challenges of managing large-scale HPC environments with the excitement of contributing to cutting-edge chip development.
The position offers a unique opportunity to work at the intersection of infrastructure management and silicon development, where your work directly impacts the creation of next-generation chips. As an SRE, you'll be responsible for designing and implementing groundbreaking compute clusters that power all silicon development across NVIDIA. The role demands expertise in high-reliability system operations, efficiency optimization, and automation implementation to enhance engineer productivity.
NVIDIA's culture emphasizes diversity, intellectual curiosity, and problem-solving in a blame-free environment. The organization brings together professionals from various backgrounds and perspectives, encouraging collaboration and innovative thinking. The role offers significant autonomy while providing the necessary support and mentorship for continuous learning and growth.
Key responsibilities include managing workload schedulers in HPC environments, developing automation solutions, troubleshooting complex issues, and collaborating with domain experts. The ideal candidate will bring extensive experience with job scheduler administration, Linux systems, container technologies, and scripting languages. The position requires both technical expertise and strong communication skills to work effectively across diverse teams.
The compensation package includes a competitive base salary range of $184,000 - $287,500 USD, plus equity and benefits. This role offers the flexibility of hybrid work arrangements and the opportunity to work in multiple locations including Santa Clara, Westford, or Austin. Join NVIDIA in their mission to amplify human imagination and intelligence through technological innovation.