NVIDIA, a pioneer in accelerated computing and AI technology, is seeking a Senior Site Reliability Engineer for their DGX Cloud team. This role combines cutting-edge AI infrastructure with cloud computing expertise, offering an opportunity to work with high-performance computing at scale.
The position involves maintaining and optimizing NVIDIA's DGX Cloud platform, a fully managed AI infrastructure service deployed across major cloud providers. As an SRE, you'll be responsible for ensuring the reliability and performance of large-scale Kubernetes clusters that power AI workloads for researchers and enterprise clients globally.
The role requires deep expertise in Kubernetes, distributed systems, and cloud infrastructure, with a focus on GPU-accelerated computing environments. You'll be working with a diverse tech stack including Kubernetes, multiple cloud platforms (AWS, GCP, Azure, OCI), and modern observability tools. The position demands both technical depth in systems engineering and the ability to drive reliability improvements through automation and architectural improvements.
NVIDIA offers a competitive compensation package with a base salary range of $208,000 to $333,500, plus equity and comprehensive benefits. The company is known for its innovative culture and impact on transformative technologies like AI and digital twins. This role provides an opportunity to work on infrastructure that powers some of the most advanced AI computing systems in the world.
The ideal candidate will bring 12+ years of production operations experience, strong programming skills, and expert-level knowledge of Kubernetes and cloud platforms. Experience with GPU workload orchestration and AI infrastructure is a significant plus. You'll be part of a team that's pushing the boundaries of what's possible in AI computing infrastructure while maintaining high standards of reliability and performance.