Senior Site Reliability Engineer - DGX Cloud

NVIDIA

NVIDIA is the world leader in accelerated computing, pioneering AI and digital twins technology.

Santa Clara, CA, USA

$168,000 - $333,500

Site Reliability

Senior Software Engineer

Remote

5,000+ Employees

10+ years of experience

AI · Enterprise SaaS

Job Description

NVIDIA, the world leader in accelerated computing, is seeking a Senior Site Reliability Engineer for their DGX Cloud platform. This role sits at the intersection of software engineering and systems operations, focusing on designing and maintaining large-scale production systems with high efficiency and availability.

The position requires expertise in Kubernetes, distributed systems, and automation, with a strong emphasis on eliminating manual work through sophisticated tooling and optimization. As an SRE at NVIDIA, you'll be responsible for ensuring maximum reliability and uptime of GPU cloud services while enabling developers to make system changes safely and efficiently.

The role offers a unique opportunity to work with cutting-edge technology in AI and cloud computing, while being part of a diverse and intellectually curious team that values problem-solving and openness. You'll be involved in the entire service lifecycle, from design consulting to production support, and will have the chance to work on meaningful projects with significant impact.

NVIDIA offers competitive compensation, including a base salary range of $168,000 - $333,500 (depending on level), equity, and comprehensive benefits. The company promotes a blame-free environment that encourages self-direction and provides support and mentorship for professional growth.

The ideal candidate will bring 10+ years of experience, strong coding skills in languages like Python or Go, and deep knowledge of Linux and containers. You'll be joining a company at the forefront of AI and digital twins technology, transforming major industries and making a profound impact on society.

Working at NVIDIA means being part of a team that tackles challenges no one else can solve, with the opportunity to contribute to groundbreaking technological advancements. The role offers flexibility with remote work options and the chance to collaborate with some of the most forward-thinking professionals in the technology industry.

Last updated a day ago

Responsibilities For Senior Site Reliability Engineer - DGX Cloud

Design, implement and support operational and reliability aspects of large scale Kubernetes clusters
Engage in service lifecycle from inception through deployment and refinement
Support services through system design consulting and developing tools
Maintain services by monitoring availability, latency and system health
Scale systems through automation
Practice sustainable incident response
Participate in on-call rotation

Requirements For Senior Site Reliability Engineer - DGX Cloud

Python

Linux

Kubernetes

BS degree in Computer Science or related technical field
10+ years of experience
Experience with Infrastructure automation and distributed systems design
Experience with Python, Go, Perl or Ruby
In-depth knowledge of Linux, Networking and Containers

Benefits For Senior Site Reliability Engineer - DGX Cloud

Equity

Medical Insurance

Equity
Medical Insurance

NVIDIA

NVIDIA is the world leader in accelerated computing, pioneering AI and digital twins technology.

Santa Clara, CA, USA

$168,000 - $333,500

Site Reliability

Senior Software Engineer

Remote

5,000+ Employees

10+ years of experience

AI · Enterprise SaaS

Senior Site Reliability Engineer - DGX Cloud

NVIDIA

Job Description

Responsibilities For Senior Site Reliability Engineer - DGX Cloud

Requirements For Senior Site Reliability Engineer - DGX Cloud

Benefits For Senior Site Reliability Engineer - DGX Cloud

NVIDIA

Related Jobs