Taro Logo

Senior Site Reliability Engineer, DGX Cloud

NVIDIA is the world leader in accelerated computing, pioneering AI and digital twins technology.
$208,000 - $333,500
Site Reliability
Senior Software Engineer
Remote
5,000+ Employees
12+ years of experience
AI · Enterprise SaaS · Cloud

Job Description

NVIDIA, a pioneer in accelerated computing and AI technology, is seeking a Senior Site Reliability Engineer for their DGX Cloud team. This role combines cutting-edge AI infrastructure with cloud computing expertise, offering an opportunity to work with high-performance computing at scale.

The position involves maintaining and optimizing NVIDIA's DGX Cloud platform, a fully managed AI infrastructure service deployed across major cloud providers. As an SRE, you'll be responsible for ensuring the reliability and performance of large-scale Kubernetes clusters that power AI workloads for researchers and enterprise clients globally.

The role requires deep expertise in Kubernetes, distributed systems, and cloud infrastructure, with a focus on GPU-accelerated computing environments. You'll be working with a diverse tech stack including Kubernetes, multiple cloud platforms (AWS, GCP, Azure, OCI), and modern observability tools. The position demands both technical depth in systems engineering and the ability to drive reliability improvements through automation and architectural improvements.

NVIDIA offers a competitive compensation package with a base salary range of $208,000 to $333,500, plus equity and comprehensive benefits. The company is known for its innovative culture and impact on transformative technologies like AI and digital twins. This role provides an opportunity to work on infrastructure that powers some of the most advanced AI computing systems in the world.

The ideal candidate will bring 12+ years of production operations experience, strong programming skills, and expert-level knowledge of Kubernetes and cloud platforms. Experience with GPU workload orchestration and AI infrastructure is a significant plus. You'll be part of a team that's pushing the boundaries of what's possible in AI computing infrastructure while maintaining high standards of reliability and performance.

Last updated 3 days ago

Responsibilities For Senior Site Reliability Engineer, DGX Cloud

  • Support large-scale Kubernetes services through system creation consulting and tools development
  • Build and implement operational aspects of large-scale Kubernetes clusters
  • Define SLOs/SLIs and monitor error budgets
  • Maintain service availability and system health
  • Operate and optimize GPU workloads across multiple cloud platforms
  • Lead triage and root-cause analysis of high-severity incidents
  • Participate in on-call rotation
  • Scale systems through automation

Requirements For Senior Site Reliability Engineer, DGX Cloud

Kubernetes
Python
Go
Linux
  • BS in Computer Science or related technical field, or equivalent experience
  • 12+ years of experience operating production services at scale
  • Expert-level knowledge of Kubernetes administration and microservices architecture
  • Experience with infrastructure automation tools
  • Proficiency in Python or Go
  • In-depth knowledge of Linux operating systems and networking fundamentals
  • Demonstrated ability to troubleshoot complex systems issues
  • Proficient knowledge of SRE principles
  • Experience with observability stacks

Benefits For Senior Site Reliability Engineer, DGX Cloud

Equity
  • Equity
  • Competitive Benefits Package

Related Jobs