Senior Site Reliability Engineer - AI Research Clusters

World leader in accelerated computing, pioneering AI and digital twins technology.
Site Reliability
Senior Software Engineer
Hybrid
5,000+ Employees
5+ years of experience
AI · Enterprise SaaS

Description For Senior Site Reliability Engineer - AI Research Clusters

NVIDIA is seeking a Senior Site Reliability Engineer to join their GPU AI/HPC Infrastructure team. This role focuses on designing and implementing cutting-edge GPU compute clusters that power AI research across NVIDIA. The position requires expertise in building and operating high-reliability, efficient, and high-performance clusters while driving automation improvements to enhance researcher productivity.

As an SRE at NVIDIA, you'll be responsible for the holistic view of system interactions, utilizing various tools and approaches to address complex challenges. The role emphasizes limiting reactive operational work, conducting blameless postmortems, and proactively identifying potential issues. NVIDIA's SRE culture values diversity, intellectual curiosity, problem-solving, and openness.

The role involves working with large-scale GPU-accelerated computing environments, developing automation solutions, and maintaining AI-HPC GPU clusters. You'll support researchers in optimizing their deep learning workflows and design systems focusing on performance at scale, real-time monitoring, logging, and alerting.

NVIDIA has transformed from its origins in PC gaming to becoming a leader in artificial intelligence and parallel computing. The company's GPUs are now central to worldwide AI research, requiring massive parallel computation capabilities. This position offers the opportunity to work with cutting-edge technology while making a lasting impact on the world of AI research and development.

The ideal candidate will bring extensive experience in site reliability engineering for high-performance computing, deep understanding of GPU computing, and proven expertise in cluster management and automation. Additional valuable skills include experience with NVIDIA GPUs, CUDA Programming, cloud deployment, and distributed storage systems.

This is a hybrid position available across multiple locations in India, offering the opportunity to work with a diverse, supportive team while contributing to groundbreaking developments in AI infrastructure.

Last updated 13 hours ago

Responsibilities For Senior Site Reliability Engineer - AI Research Clusters

  • Design and implement state-of-the-art GPU compute clusters
  • Optimize cluster operations for maximum reliability, efficiency, and performance
  • Drive foundational improvements and automation to enhance researcher productivity
  • Troubleshoot, diagnose, and root cause system failures
  • Scale systems through automation
  • Participate in on-call rotation to support production systems
  • Write and review code, develop documentation and capacity plans
  • Manage upgrades and automated rollbacks across all clusters

Requirements For Senior Site Reliability Engineer - AI Research Clusters

Python
Kubernetes
Linux
  • Bachelor's degree in computer science, Electrical Engineering or related field
  • 5+ years of experience designing and operating large scale compute infrastructure
  • Operational experience of at least 2K GPUs cluster
  • Deep understanding of GPU computing and AI infrastructure
  • Experience with AI/HPC advanced job schedulers (Slurm)
  • Knowledge of cluster configuration management tools (BCM, Ansible)
  • Understanding of container technologies (Docker, Enroot)
  • Experience programming in Python and Bash scripting

Interested in this job?

Jobs Related To NVIDIA Senior Site Reliability Engineer - AI Research Clusters

Senior SRE Software Engineer, Storage and Data

Senior SRE Software Engineer position at NVIDIA, focusing on storage infrastructure for DGX Cloud platform, requiring 5+ years of experience in storage systems and reliability engineering.

Senior SRE Software Engineer, Storage and Data

Senior SRE position at NVIDIA focusing on storage infrastructure reliability and performance optimization for DGX Cloud platform.

Senior Site Reliability Engineer - AI Research Clusters

Senior Site Reliability Engineer position at NVIDIA focusing on AI research clusters, offering competitive compensation and the opportunity to work with cutting-edge GPU technology.

Senior Site Reliability Engineer, Data Science and ML Platforms

Senior Site Reliability Engineer for NVIDIA's Data Science & ML Platforms team, focusing on large-scale production systems and SRE practices.

Site Reliability Engineer - Core

Senior Site Reliability Engineer position at Blockchain.com, focusing on infrastructure, security, and scalability for a leading digital assets platform.