Senior Site Reliability Engineer, DGX Cloud

NVIDIA

NVIDIA is the world leader in accelerated computing, pioneering AI and digital twins technology to transform industries.

India

Site Reliability

Senior Software Engineer

Remote

5,000+ Employees

10+ years of experience

AI · Enterprise SaaS · Cloud

Job Description

NVIDIA, a pioneer in accelerated computing and AI technology for over 25 years, is seeking a Senior Site Reliability Engineer for their DGX Cloud initiative. This role is crucial in delivering a fully managed AI platform across major cloud providers, optimizing AI workloads using high-performance NVIDIA infrastructure. The position involves managing large-scale Kubernetes clusters, ensuring system reliability, and maintaining high-performance DGX Cloud clusters for AI researchers and enterprise clients worldwide. The ideal candidate will have extensive experience in SRE practices, Kubernetes administration, and cloud platforms. This is an opportunity to work with cutting-edge AI infrastructure while being part of a company that's transforming industries through AI and digital twins technology. The role offers the flexibility of remote work and the chance to make a significant impact on NVIDIA's cloud infrastructure. As an NVIDIAN, you'll join a diverse, supportive environment where innovation and technical excellence are paramount.

Last updated 3 days ago

Responsibilities For Senior Site Reliability Engineer, DGX Cloud

Build, implement and support operational and reliability aspects of large-scale Kubernetes clusters
Define SLOs/SLIs, monitor error budgets, and streamline reporting
Support services through system creation consulting and launch reviews
Maintain services by measuring and monitoring availability, latency and system health
Operate and optimize GPU workloads across multiple cloud platforms
Scale systems through automation
Lead triage and root-cause analysis of high-severity incidents
Practice balanced incident response and blameless postmortems
Participate in on-call rotation

Requirements For Senior Site Reliability Engineer, DGX Cloud

Kubernetes

Python

Linux

BS in Computer Science or related technical field, or equivalent experience
10+ years of experience operating production services
Expert-level knowledge of Kubernetes administration, containerization, and microservices architecture
Experience with infrastructure automation tools
Proficiency in at least one high-level programming language
In-depth knowledge of Linux operating systems, networking fundamentals, and cloud security standards
Proficient knowledge of SRE principles
Experience building and operating comprehensive observability stacks

NVIDIA

NVIDIA is the world leader in accelerated computing, pioneering AI and digital twins technology to transform industries.

India

Site Reliability

Senior Software Engineer

Remote

5,000+ Employees

10+ years of experience

AI · Enterprise SaaS · Cloud

NVIDIA

Senior Site Reliability Engineer position at NVIDIA, focusing on maintaining and optimizing AI infrastructure systems across global cloud platforms.

Senior Site Reliability Engineer - GPU Cloud

NVIDIA

Senior Site Reliability Engineer position at NVIDIA, managing GPU Cloud infrastructure and automation for AI/ML platforms, requiring 8+ years of experience in distributed systems.

Senior Site Reliability Engineer

Microsoft

Senior Site Reliability Engineer role at Microsoft's Windows Cloud division in Hyderabad, focusing on Windows 365 and Azure Virtual Desktop platform reliability and automation.

Site Reliability Engineer - Career

Equifax

Senior Site Reliability Engineer position at Equifax in Pune, focusing on cloud infrastructure, automation, and system reliability with 5+ years of experience required.

Senior Site Reliability Engineer I

RELX (Cirium)

Senior Site Reliability Engineer position at Cirium (RELX) focusing on cloud infrastructure, automation, and system reliability for aviation analytics platforms.

Senior Site Reliability Engineer, DGX Cloud

NVIDIA

Job Description

Responsibilities For Senior Site Reliability Engineer, DGX Cloud

Requirements For Senior Site Reliability Engineer, DGX Cloud

NVIDIA

Related Jobs