Taro Logo

Senior Site Reliability Engineer, DGX Cloud

NVIDIA is the world leader in accelerated computing, pioneering AI and digital twins technology to transform industries.
Site Reliability
Senior Software Engineer
Remote
5,000+ Employees
10+ years of experience
AI · Enterprise SaaS · Cloud

Job Description

NVIDIA, a pioneer in accelerated computing and AI technology for over 25 years, is seeking a Senior Site Reliability Engineer for their DGX Cloud initiative. This role is crucial in delivering a fully managed AI platform across major cloud providers, optimizing AI workloads using high-performance NVIDIA infrastructure. The position involves managing large-scale Kubernetes clusters, ensuring system reliability, and maintaining high-performance DGX Cloud clusters for AI researchers and enterprise clients worldwide. The ideal candidate will have extensive experience in SRE practices, Kubernetes administration, and cloud platforms. This is an opportunity to work with cutting-edge AI infrastructure while being part of a company that's transforming industries through AI and digital twins technology. The role offers the flexibility of remote work and the chance to make a significant impact on NVIDIA's cloud infrastructure. As an NVIDIAN, you'll join a diverse, supportive environment where innovation and technical excellence are paramount.

Last updated 3 days ago

Responsibilities For Senior Site Reliability Engineer, DGX Cloud

  • Build, implement and support operational and reliability aspects of large-scale Kubernetes clusters
  • Define SLOs/SLIs, monitor error budgets, and streamline reporting
  • Support services through system creation consulting and launch reviews
  • Maintain services by measuring and monitoring availability, latency and system health
  • Operate and optimize GPU workloads across multiple cloud platforms
  • Scale systems through automation
  • Lead triage and root-cause analysis of high-severity incidents
  • Practice balanced incident response and blameless postmortems
  • Participate in on-call rotation

Requirements For Senior Site Reliability Engineer, DGX Cloud

Kubernetes
Python
Go
Linux
  • BS in Computer Science or related technical field, or equivalent experience
  • 10+ years of experience operating production services
  • Expert-level knowledge of Kubernetes administration, containerization, and microservices architecture
  • Experience with infrastructure automation tools
  • Proficiency in at least one high-level programming language
  • In-depth knowledge of Linux operating systems, networking fundamentals, and cloud security standards
  • Proficient knowledge of SRE principles
  • Experience building and operating comprehensive observability stacks

Related Jobs

Senior Site Reliability Engineer, AI Infrastructure

Senior Site Reliability Engineer position at NVIDIA, focusing on maintaining and optimizing AI infrastructure systems across global cloud platforms.

Senior Site Reliability Engineer - GPU Cloud

Senior Site Reliability Engineer position at NVIDIA, managing GPU Cloud infrastructure and automation for AI/ML platforms, requiring 8+ years of experience in distributed systems.

Senior Site Reliability Engineer

Senior Site Reliability Engineer role at Microsoft's Windows Cloud division in Hyderabad, focusing on Windows 365 and Azure Virtual Desktop platform reliability and automation.

Site Reliability Engineer - Career

Senior Site Reliability Engineer position at Equifax in Pune, focusing on cloud infrastructure, automation, and system reliability with 5+ years of experience required.

Senior Site Reliability Engineer I

Senior Site Reliability Engineer position at Cirium (RELX) focusing on cloud infrastructure, automation, and system reliability for aviation analytics platforms.