NVIDIA, a global leader in accelerated computing and GPU technology, is seeking a Senior Site Reliability Engineer to join their GPU AI/HPC Infrastructure team. This role offers an opportunity to work on groundbreaking GPU compute clusters that power AI research across NVIDIA. As an SRE, you'll be responsible for designing, implementing, and maintaining high-performance computing environments while focusing on reliability, efficiency, and performance optimization.
The position involves working with cutting-edge technology in AI and GPU computing, where you'll be part of a diverse and collaborative team that values intellectual curiosity and problem-solving. You'll be building and improving the ecosystem around GPU-accelerated computing, developing large-scale automation solutions, and supporting researchers in optimizing their deep learning workflows.
Key responsibilities include designing state-of-the-art GPU compute clusters, implementing automation for enhanced productivity, and ensuring system reliability through proactive monitoring and incident response. You'll work with advanced technologies including Kubernetes, container platforms, and high-performance computing schedulers.
The ideal candidate brings 5+ years of experience in large-scale infrastructure operations, strong expertise in Python programming, and deep understanding of GPU computing and AI infrastructure. This role offers the opportunity to make a lasting impact on NVIDIA's AI research capabilities while working in a supportive environment that promotes learning and growth.
Join NVIDIA's team of innovators who are pushing the boundaries of technology and transforming industries through GPU computing and artificial intelligence. This position offers the chance to work on some of the largest and most complex systems in the world while contributing to groundbreaking advancements in AI research infrastructure.