Taro Logo

Senior Site Reliability Engineer, AI Infrastructure

NVIDIA is the world leader in accelerated computing, pioneering AI and digital twins technology.
Site Reliability
Senior Software Engineer
Hybrid
5,000+ Employees
12+ years of experience
AI · Enterprise SaaS

Job Description

NVIDIA, a global leader in accelerated computing and AI technology, is seeking a Senior Site Reliability Engineer to join their AI Infrastructure team. This role combines the challenges of large-scale system development with the cutting-edge field of AI infrastructure. As an SRE, you'll be responsible for maintaining and optimizing critical systems that power NVIDIA's AI capabilities across global public and private clouds. The position offers the opportunity to work with state-of-the-art technology while implementing SRE best practices, including incident management, monitoring, and performance optimization.

The ideal candidate will bring 12+ years of experience in Software Development or SRE, along with strong expertise in Python programming and cloud platforms. You'll be working in an environment that values innovation, continuous learning, and technical excellence. The role involves not just technical work but also mentoring peers and contributing to a diverse, high-performing team.

NVIDIA offers a unique opportunity to work at the intersection of SRE and AI, where you'll be handling sophisticated infrastructure that powers some of the most advanced AI systems in the world. The company's culture encourages creativity, autonomy, and forward-thinking, making it one of the technology world's most desirable employers. You'll be part of a team that's defining the next era of computing, working on systems that power computers, robots, and self-driving cars that can understand the world.

This position provides the chance to make a lasting impact on the world while working with cutting-edge technology and outstanding colleagues. The role offers exposure to deep learning frameworks, AI training and inferencing systems, and the opportunity to work on distributed systems with stringent SLAs. If you're passionate about reliability engineering and want to be at the forefront of AI infrastructure, this role at NVIDIA presents an exceptional opportunity to advance your career while contributing to groundbreaking technological advancements.

Last updated 7 days ago

Responsibilities For Senior Site Reliability Engineer, AI Infrastructure

  • Develop and maintain large-scale systems supporting critical use cases for AI Infrastructure
  • Implement SRE fundamentals including incident management, monitoring, and performance optimization
  • Build tools and frameworks to improve observability
  • Establish frameworks for operational maturity
  • Work with engineering teams to deliver innovative solutions and mentor peers

Requirements For Senior Site Reliability Engineer, AI Infrastructure

Python
Linux
Kubernetes
  • Degree in Computer Science or related field, or equivalent experience with 12+ years in Software Development, SRE, or Production Engineering
  • Proficiency in Python and at least one other language (C/C++, Go, Perl, Ruby)
  • Expertise in systems engineering within Linux or Windows environments and cloud platforms
  • Strong understanding of SRE principles
  • Hands-on experience with observability platforms and CI/CD systems
  • Strong communication skills
  • Commitment to fostering a culture of diversity, curiosity, and continuous improvement

Related Jobs

Senior Site Reliability Engineer, DGX Cloud

Senior SRE position at NVIDIA focusing on managing and optimizing DGX Cloud clusters for AI workloads across major cloud providers.

Senior Site Reliability Engineer - GPU Cloud

Senior Site Reliability Engineer position at NVIDIA, managing GPU Cloud infrastructure and automation for AI/ML platforms, requiring 8+ years of experience in distributed systems.

Senior Site Reliability Engineer

Senior Site Reliability Engineer role at Microsoft's Windows Cloud division in Hyderabad, focusing on Windows 365 and Azure Virtual Desktop platform reliability and automation.

Site Reliability Engineer - Career

Senior Site Reliability Engineer position at Equifax in Pune, focusing on cloud infrastructure, automation, and system reliability with 5+ years of experience required.

Senior Site Reliability Engineer I

Senior Site Reliability Engineer position at Cirium (RELX) focusing on cloud infrastructure, automation, and system reliability for aviation analytics platforms.