Senior Site Reliability Engineer - GPU Cloud

NVIDIA

NVIDIA is the world leader in accelerated computing, pioneering AI and digital twins technology.

Bengaluru, Karnataka, India

Site Reliability

Senior Software Engineer

In-Person

5,000+ Employees

8+ years of experience

AI · Enterprise SaaS · Cloud

Job Description

NVIDIA, a pioneer in Accelerated Computing, is seeking a Senior Site Reliability Engineer for their GPU Cloud team. This role is part of a dynamic SRE team managing cloud and on-prem infrastructure for High-Performance & Distributed Computing. The position involves working with cutting-edge technology in AI, LLMs, and cloud computing, supporting NVIDIA's GPU cloud platform that serves both internal R&D teams and external AI/ML customers.

The role requires managing infrastructure spanning thousands of GPU nodes, focusing on automation, monitoring, and analytics solutions. You'll be responsible for the complete lifecycle of tools and services, from design to deployment, while ensuring high reliability and availability of the platform. This is an opportunity to work with state-of-the-art technology while solving complex infrastructure challenges.

The ideal candidate will bring 8+ years of experience in large-scale distributed systems, strong programming skills, and expertise in cloud infrastructure and Kubernetes. You'll be joining a company at the forefront of AI and accelerated computing innovation, working on technology that's transforming major industries.

NVIDIA offers a collaborative environment where creativity and autonomy are valued. The company is committed to diversity and inclusion, providing equal opportunities to all qualified candidates. This role offers the chance to work with some of the most forward-thinking professionals in the technology industry while contributing to groundbreaking innovations in AI, cloud computing, and high-performance computing.

Last updated a month ago

Responsibilities For Senior Site Reliability Engineer - GPU Cloud

Provide scalable and robust service oriented infrastructure automation, monitoring and analytics solutions for NVIDIA's on-prem and cloud based GPU infrastructure
Own the whole life cycle of new tools and services - from requirements gathering, to design documentation, validation and deployment
Provide customer support on a rotation basis

Requirements For Senior Site Reliability Engineer - GPU Cloud

Python

Kubernetes

Minimum of 8 years of experience in automating and handling large-scale distributed system software deployments
Proficiency in any language - Go/Python/Perl/C++/Java/C
Strong command on terraform, Kubernetes and cloud infra administration
Excellent debugging and troubleshooting skills
Ability to design simple and reliable systems
Outstanding teammate who can collaborate and influence in a multifaceted environment
Excellent interpersonal, and written communication skills
M.Sc or B.E in Computer Science or a related technical field

NVIDIA

NVIDIA is the world leader in accelerated computing, pioneering AI and digital twins technology.

Bengaluru, Karnataka, India

Site Reliability

Senior Software Engineer

In-Person

5,000+ Employees

8+ years of experience

AI · Enterprise SaaS · Cloud

NVIDIA

Senior SRE position at NVIDIA focusing on managing and optimizing DGX Cloud clusters for AI workloads across major cloud providers.

Senior Site Reliability Engineer, AI Infrastructure

NVIDIA

Senior Site Reliability Engineer position at NVIDIA, focusing on maintaining and optimizing AI infrastructure systems across global cloud platforms.

Senior Site Reliability Engineer

Microsoft

Senior Site Reliability Engineer role at Microsoft's Windows Cloud division in Hyderabad, focusing on Windows 365 and Azure Virtual Desktop platform reliability and automation.

Site Reliability Engineer - Career

Equifax

Senior Site Reliability Engineer position at Equifax in Pune, focusing on cloud infrastructure, automation, and system reliability with 5+ years of experience required.

Senior Site Reliability Engineer I

RELX (Cirium)

Senior Site Reliability Engineer position at Cirium (RELX) focusing on cloud infrastructure, automation, and system reliability for aviation analytics platforms.

Senior Site Reliability Engineer - GPU Cloud

NVIDIA

Job Description

Responsibilities For Senior Site Reliability Engineer - GPU Cloud

Requirements For Senior Site Reliability Engineer - GPU Cloud

NVIDIA

Related Jobs