Taro Logo

Senior Site Reliability Engineer - DGX Cloud

NVIDIA is the world leader in accelerated computing, pioneering AI and digital twins technology.
$168,000 - $333,500
Site Reliability
Senior Software Engineer
Remote
5,000+ Employees
10+ years of experience
AI · Enterprise SaaS

Job Description

NVIDIA, the world leader in accelerated computing, is seeking a Senior Site Reliability Engineer for their DGX Cloud platform. This role sits at the intersection of software engineering and systems operations, focusing on designing and maintaining large-scale production systems with high efficiency and availability.

The position requires expertise in Kubernetes, distributed systems, and automation, with a strong emphasis on eliminating manual work through sophisticated tooling and optimization. As an SRE at NVIDIA, you'll be responsible for ensuring maximum reliability and uptime of GPU cloud services while enabling developers to make system changes safely and efficiently.

The role offers a unique opportunity to work with cutting-edge technology in AI and cloud computing, while being part of a diverse and intellectually curious team that values problem-solving and openness. You'll be involved in the entire service lifecycle, from design consulting to production support, and will have the chance to work on meaningful projects with significant impact.

NVIDIA offers competitive compensation, including a base salary range of $168,000 - $333,500 (depending on level), equity, and comprehensive benefits. The company promotes a blame-free environment that encourages self-direction and provides support and mentorship for professional growth.

The ideal candidate will bring 10+ years of experience, strong coding skills in languages like Python or Go, and deep knowledge of Linux and containers. You'll be joining a company at the forefront of AI and digital twins technology, transforming major industries and making a profound impact on society.

Working at NVIDIA means being part of a team that tackles challenges no one else can solve, with the opportunity to contribute to groundbreaking technological advancements. The role offers flexibility with remote work options and the chance to collaborate with some of the most forward-thinking professionals in the technology industry.

Last updated a day ago

Responsibilities For Senior Site Reliability Engineer - DGX Cloud

  • Design, implement and support operational and reliability aspects of large scale Kubernetes clusters
  • Engage in service lifecycle from inception through deployment and refinement
  • Support services through system design consulting and developing tools
  • Maintain services by monitoring availability, latency and system health
  • Scale systems through automation
  • Practice sustainable incident response
  • Participate in on-call rotation

Requirements For Senior Site Reliability Engineer - DGX Cloud

Python
Go
Linux
Kubernetes
  • BS degree in Computer Science or related technical field
  • 10+ years of experience
  • Experience with Infrastructure automation and distributed systems design
  • Experience with Python, Go, Perl or Ruby
  • In-depth knowledge of Linux, Networking and Containers

Benefits For Senior Site Reliability Engineer - DGX Cloud

Equity
Medical Insurance
  • Equity
  • Medical Insurance

Related Jobs