Taro Logo

Senior Site Reliability Engineer - GPU Cloud

NVIDIA is the world leader in accelerated computing, pioneering AI and digital twins technology.
Site Reliability
Senior Software Engineer
In-Person
5,000+ Employees
8+ years of experience
AI · Enterprise SaaS · Cloud

Job Description

NVIDIA, a pioneer in Accelerated Computing, is seeking a Senior Site Reliability Engineer for their GPU Cloud team. This role is part of a dynamic SRE team managing cloud and on-prem infrastructure for High-Performance & Distributed Computing. The position involves working with cutting-edge technology in AI, LLMs, and cloud computing, supporting NVIDIA's GPU cloud platform that serves both internal R&D teams and external AI/ML customers.

The role requires managing infrastructure spanning thousands of GPU nodes, focusing on automation, monitoring, and analytics solutions. You'll be responsible for the complete lifecycle of tools and services, from design to deployment, while ensuring high reliability and availability of the platform. This is an opportunity to work with state-of-the-art technology while solving complex infrastructure challenges.

The ideal candidate will bring 8+ years of experience in large-scale distributed systems, strong programming skills, and expertise in cloud infrastructure and Kubernetes. You'll be joining a company at the forefront of AI and accelerated computing innovation, working on technology that's transforming major industries.

NVIDIA offers a collaborative environment where creativity and autonomy are valued. The company is committed to diversity and inclusion, providing equal opportunities to all qualified candidates. This role offers the chance to work with some of the most forward-thinking professionals in the technology industry while contributing to groundbreaking innovations in AI, cloud computing, and high-performance computing.

Last updated a month ago

Responsibilities For Senior Site Reliability Engineer - GPU Cloud

  • Provide scalable and robust service oriented infrastructure automation, monitoring and analytics solutions for NVIDIA's on-prem and cloud based GPU infrastructure
  • Own the whole life cycle of new tools and services - from requirements gathering, to design documentation, validation and deployment
  • Provide customer support on a rotation basis

Requirements For Senior Site Reliability Engineer - GPU Cloud

Go
Python
Kubernetes
  • Minimum of 8 years of experience in automating and handling large-scale distributed system software deployments
  • Proficiency in any language - Go/Python/Perl/C++/Java/C
  • Strong command on terraform, Kubernetes and cloud infra administration
  • Excellent debugging and troubleshooting skills
  • Ability to design simple and reliable systems
  • Outstanding teammate who can collaborate and influence in a multifaceted environment
  • Excellent interpersonal, and written communication skills
  • M.Sc or B.E in Computer Science or a related technical field

Related Jobs

Senior Site Reliability Engineer, DGX Cloud

Senior SRE position at NVIDIA focusing on managing and optimizing DGX Cloud clusters for AI workloads across major cloud providers.

Senior Site Reliability Engineer, AI Infrastructure

Senior Site Reliability Engineer position at NVIDIA, focusing on maintaining and optimizing AI infrastructure systems across global cloud platforms.

Senior Site Reliability Engineer

Senior Site Reliability Engineer role at Microsoft's Windows Cloud division in Hyderabad, focusing on Windows 365 and Azure Virtual Desktop platform reliability and automation.

Site Reliability Engineer - Career

Senior Site Reliability Engineer position at Equifax in Pune, focusing on cloud infrastructure, automation, and system reliability with 5+ years of experience required.

Senior Site Reliability Engineer I

Senior Site Reliability Engineer position at Cirium (RELX) focusing on cloud infrastructure, automation, and system reliability for aviation analytics platforms.