Senior Site Reliability Engineer - AI Research Clusters

NVIDIA

NVIDIA is the world leader in accelerated computing, pioneering GPU technology and AI solutions.

Bengaluru, Karnataka, India • Hyderabad, Telangana, India • Pune, Maharashtra, India…

Site Reliability

Senior Software Engineer

Hybrid

5,000+ Employees

5+ years of experience

AI · Enterprise SaaS

This job posting may no longer be active. You may be interested in these related jobs instead:

Description For Senior Site Reliability Engineer - AI Research Clusters

NVIDIA, a global leader in accelerated computing and GPU technology, is seeking a Senior Site Reliability Engineer to join their GPU AI/HPC Infrastructure team. This role offers an opportunity to work on groundbreaking GPU compute clusters that power AI research across NVIDIA. As an SRE, you'll be responsible for designing, implementing, and maintaining high-performance computing environments while focusing on reliability, efficiency, and performance optimization.

The position involves working with cutting-edge technology in AI and GPU computing, where you'll be part of a diverse and collaborative team that values intellectual curiosity and problem-solving. You'll be building and improving the ecosystem around GPU-accelerated computing, developing large-scale automation solutions, and supporting researchers in optimizing their deep learning workflows.

Key responsibilities include designing state-of-the-art GPU compute clusters, implementing automation for enhanced productivity, and ensuring system reliability through proactive monitoring and incident response. You'll work with advanced technologies including Kubernetes, container platforms, and high-performance computing schedulers.

The ideal candidate brings 5+ years of experience in large-scale infrastructure operations, strong expertise in Python programming, and deep understanding of GPU computing and AI infrastructure. This role offers the opportunity to make a lasting impact on NVIDIA's AI research capabilities while working in a supportive environment that promotes learning and growth.

Join NVIDIA's team of innovators who are pushing the boundaries of technology and transforming industries through GPU computing and artificial intelligence. This position offers the chance to work on some of the largest and most complex systems in the world while contributing to groundbreaking advancements in AI research infrastructure.

Last updated a month ago

Responsibilities For Senior Site Reliability Engineer - AI Research Clusters

Design and implement state-of-the-art GPU compute clusters
Optimize cluster operations for maximum reliability, efficiency, and performance
Drive foundational improvements and automation to enhance researcher productivity
Troubleshoot and diagnose system failures
Scale systems through automation
Participate in on-call rotation to support production systems
Write and review code, develop documentation and capacity plans
Manage upgrades and automated rollbacks across all clusters

Requirements For Senior Site Reliability Engineer - AI Research Clusters

Python

Kubernetes

Linux

Bachelor's degree in computer science, Electrical Engineering or related field
5+ years of experience designing and operating large scale compute infrastructure
Operational experience of at least 2K GPUs cluster
Deep understanding of GPU computing and AI infrastructure
Experience with AI/HPC advanced job schedulers like Slurm
Knowledge of cluster configuration management tools (BCM, Ansible)
Experience with container technologies like Docker, Enroot
Experience programming in Python and Bash scripting