Taro Logo

Senior Site Reliability Engineer - AI Research Clusters

NVIDIA is the world leader in accelerated computing, pioneering GPU technology and AI solutions.
Site Reliability
Senior Software Engineer
Hybrid
5,000+ Employees
5+ years of experience
AI · Enterprise SaaS
This job posting may no longer be active. You may be interested in these related jobs instead:

Description For Senior Site Reliability Engineer - AI Research Clusters

NVIDIA, a global leader in accelerated computing and GPU technology, is seeking a Senior Site Reliability Engineer to join their GPU AI/HPC Infrastructure team. This role offers an opportunity to work on groundbreaking GPU compute clusters that power AI research across NVIDIA. As an SRE, you'll be responsible for designing, implementing, and maintaining high-performance computing environments while focusing on reliability, efficiency, and performance optimization.

The position involves working with cutting-edge technology in AI and GPU computing, where you'll be part of a diverse and collaborative team that values intellectual curiosity and problem-solving. You'll be building and improving the ecosystem around GPU-accelerated computing, developing large-scale automation solutions, and supporting researchers in optimizing their deep learning workflows.

Key responsibilities include designing state-of-the-art GPU compute clusters, implementing automation for enhanced productivity, and ensuring system reliability through proactive monitoring and incident response. You'll work with advanced technologies including Kubernetes, container platforms, and high-performance computing schedulers.

The ideal candidate brings 5+ years of experience in large-scale infrastructure operations, strong expertise in Python programming, and deep understanding of GPU computing and AI infrastructure. This role offers the opportunity to make a lasting impact on NVIDIA's AI research capabilities while working in a supportive environment that promotes learning and growth.

Join NVIDIA's team of innovators who are pushing the boundaries of technology and transforming industries through GPU computing and artificial intelligence. This position offers the chance to work on some of the largest and most complex systems in the world while contributing to groundbreaking advancements in AI research infrastructure.

Last updated a month ago

Responsibilities For Senior Site Reliability Engineer - AI Research Clusters

  • Design and implement state-of-the-art GPU compute clusters
  • Optimize cluster operations for maximum reliability, efficiency, and performance
  • Drive foundational improvements and automation to enhance researcher productivity
  • Troubleshoot and diagnose system failures
  • Scale systems through automation
  • Participate in on-call rotation to support production systems
  • Write and review code, develop documentation and capacity plans
  • Manage upgrades and automated rollbacks across all clusters

Requirements For Senior Site Reliability Engineer - AI Research Clusters

Python
Kubernetes
Linux
  • Bachelor's degree in computer science, Electrical Engineering or related field
  • 5+ years of experience designing and operating large scale compute infrastructure
  • Operational experience of at least 2K GPUs cluster
  • Deep understanding of GPU computing and AI infrastructure
  • Experience with AI/HPC advanced job schedulers like Slurm
  • Knowledge of cluster configuration management tools (BCM, Ansible)
  • Experience with container technologies like Docker, Enroot
  • Experience programming in Python and Bash scripting

Interested in this job?