Taro Logo

Senior Site Reliability Engineer - DGX Cloud

NVIDIA is the world leader in accelerated computing, pioneering AI and digital twins technology to transform industries.
$144,000 - $333,500
Site Reliability
Senior Software Engineer
Remote
5,000+ Employees
5+ years of experience
AI · Enterprise SaaS

Description For Senior Site Reliability Engineer - DGX Cloud

NVIDIA, the world leader in accelerated computing, is seeking a Senior Site Reliability Engineer for their DGX Cloud team. This role combines software and systems engineering practices to design, build, and maintain large-scale production systems. As an SRE at NVIDIA, you'll work with cutting-edge technologies like Kubernetes and OpenStack to ensure maximum reliability of GPU cloud services. The position requires expertise in systems, networking, coding, database management, and continuous deployment. You'll be part of a diverse, intellectually curious team that values problem-solving and openness. The role offers opportunities to work on meaningful projects with support and mentorship for growth. NVIDIA provides a blame-free environment that encourages innovation and risk-taking. The company's work in AI and digital twins is transforming major industries, making this an opportunity to impact society through technology. The position offers competitive compensation including a base salary range of $144,000-$333,500, plus equity and comprehensive benefits. NVIDIA values diversity and maintains an inclusive work environment, making it one of technology's most desirable employers.

Last updated 7 minutes ago

Responsibilities For Senior Site Reliability Engineer - DGX Cloud

  • Design, implement and support operational and reliability aspects of large scale Kubernetes clusters
  • Engage in service lifecycle from inception through deployment and refinement
  • Support services through system design consulting, developing tools, capacity management and launch reviews
  • Maintain services by measuring and monitoring availability, latency and system health
  • Scale systems through automation and improve reliability
  • Practice sustainable incident response and blameless postmortems
  • Participate in on-call rotation for production systems

Requirements For Senior Site Reliability Engineer - DGX Cloud

Python
Go
Linux
Kubernetes
  • BS degree in Computer Science or related technical field
  • 5+ years of experience
  • Experience with Infrastructure automation and distributed systems design
  • Experience with Python, Go, Perl or Ruby
  • In-depth knowledge of Linux, Networking and Containers

Benefits For Senior Site Reliability Engineer - DGX Cloud

Medical Insurance
Equity
  • Base salary
  • Equity
  • Benefits package

Interested in this job?

Jobs Related To NVIDIA Senior Site Reliability Engineer - DGX Cloud