Senior Site Reliability Engineer, DGX Cloud

NVIDIA

NVIDIA is the world leader in accelerated computing, pioneering AI and digital twins technology.

Santa Clara, CA, USA

$208,000 - $333,500

Site Reliability

Senior Software Engineer

Remote

5,000+ Employees

12+ years of experience

AI · Enterprise SaaS · Cloud

Job Description

NVIDIA, a pioneer in accelerated computing and AI technology, is seeking a Senior Site Reliability Engineer for their DGX Cloud team. This role combines cutting-edge AI infrastructure with cloud computing expertise, offering an opportunity to work with high-performance computing at scale.

The position involves maintaining and optimizing NVIDIA's DGX Cloud platform, a fully managed AI infrastructure service deployed across major cloud providers. As an SRE, you'll be responsible for ensuring the reliability and performance of large-scale Kubernetes clusters that power AI workloads for researchers and enterprise clients globally.

The role requires deep expertise in Kubernetes, distributed systems, and cloud infrastructure, with a focus on GPU-accelerated computing environments. You'll be working with a diverse tech stack including Kubernetes, multiple cloud platforms (AWS, GCP, Azure, OCI), and modern observability tools. The position demands both technical depth in systems engineering and the ability to drive reliability improvements through automation and architectural improvements.

NVIDIA offers a competitive compensation package with a base salary range of $208,000 to $333,500, plus equity and comprehensive benefits. The company is known for its innovative culture and impact on transformative technologies like AI and digital twins. This role provides an opportunity to work on infrastructure that powers some of the most advanced AI computing systems in the world.

The ideal candidate will bring 12+ years of production operations experience, strong programming skills, and expert-level knowledge of Kubernetes and cloud platforms. Experience with GPU workload orchestration and AI infrastructure is a significant plus. You'll be part of a team that's pushing the boundaries of what's possible in AI computing infrastructure while maintaining high standards of reliability and performance.

Last updated 3 days ago

Responsibilities For Senior Site Reliability Engineer, DGX Cloud

Support large-scale Kubernetes services through system creation consulting and tools development
Build and implement operational aspects of large-scale Kubernetes clusters
Define SLOs/SLIs and monitor error budgets
Maintain service availability and system health
Operate and optimize GPU workloads across multiple cloud platforms
Lead triage and root-cause analysis of high-severity incidents
Participate in on-call rotation
Scale systems through automation

Requirements For Senior Site Reliability Engineer, DGX Cloud

Kubernetes

Python

Linux

BS in Computer Science or related technical field, or equivalent experience
12+ years of experience operating production services at scale
Expert-level knowledge of Kubernetes administration and microservices architecture
Experience with infrastructure automation tools
Proficiency in Python or Go
In-depth knowledge of Linux operating systems and networking fundamentals
Demonstrated ability to troubleshoot complex systems issues
Proficient knowledge of SRE principles
Experience with observability stacks

Benefits For Senior Site Reliability Engineer, DGX Cloud

Equity

Equity
Competitive Benefits Package

NVIDIA

NVIDIA is the world leader in accelerated computing, pioneering AI and digital twins technology.

Santa Clara, CA, USA

$208,000 - $333,500

Site Reliability

Senior Software Engineer

Remote

5,000+ Employees

12+ years of experience

AI · Enterprise SaaS · Cloud

Senior Site Reliability Engineer, DGX Cloud

NVIDIA

Job Description

Responsibilities For Senior Site Reliability Engineer, DGX Cloud

Requirements For Senior Site Reliability Engineer, DGX Cloud

Benefits For Senior Site Reliability Engineer, DGX Cloud

NVIDIA

Related Jobs