NVIDIA is seeking a Senior Site Reliability Engineer to join their GPU AI/HPC Infrastructure team. This role focuses on designing and implementing cutting-edge GPU compute clusters that power AI research across NVIDIA. The position requires expertise in building and operating high-reliability, efficient, and high-performance clusters while driving automation improvements to enhance researcher productivity.
As an SRE at NVIDIA, you'll be responsible for the holistic view of system interactions, utilizing various tools and approaches to address complex challenges. The role emphasizes limiting reactive operational work, conducting blameless postmortems, and proactively identifying potential issues. NVIDIA's SRE culture values diversity, intellectual curiosity, problem-solving, and openness.
The role involves working with large-scale GPU-accelerated computing environments, developing automation solutions, and maintaining AI-HPC GPU clusters. You'll support researchers in optimizing their deep learning workflows and design systems focusing on performance at scale, real-time monitoring, logging, and alerting.
NVIDIA has transformed from its origins in PC gaming to becoming a leader in artificial intelligence and parallel computing. The company's GPUs are now central to worldwide AI research, requiring massive parallel computation capabilities. This position offers the opportunity to work with cutting-edge technology while making a lasting impact on the world of AI research and development.
The ideal candidate will bring extensive experience in site reliability engineering for high-performance computing, deep understanding of GPU computing, and proven expertise in cluster management and automation. Additional valuable skills include experience with NVIDIA GPUs, CUDA Programming, cloud deployment, and distributed storage systems.
This is a hybrid position available across multiple locations in India, offering the opportunity to work with a diverse, supportive team while contributing to groundbreaking developments in AI infrastructure.