Taro Logo

Senior Software Engineer, AI Resiliency

NVIDIA is the world leader in accelerated computing, pioneering AI and digital twins technology.
$184,000 - $287,500
Senior Software Engineer
In-Person
5,000+ Employees
6+ years of experience
AI

Description For Senior Software Engineer, AI Resiliency

NVIDIA, the world leader in accelerated computing, is seeking a Senior Software Engineer to spearhead AI software resiliency development for the world's most powerful AI supercomputers. This role is crucial in the AI Software Resiliency team, focusing on implementing critical features for AI systems operating at a scale of 100,000+ GPUs. The position involves developing sophisticated resiliency features, optimizing system performance, and ensuring near-zero downtime for massive AI clusters.

The role combines deep technical expertise in distributed systems, AI frameworks, and high-performance computing. You'll work on implementing essential features like fast checkpoint-recovery, error detection, and straggler/hang detection. The position requires strong coding skills in C++ and Python, with a focus on production-level code that can handle AI workloads across thousands of GPUs.

Working at NVIDIA means joining a team of world-class engineers tackling the hardest challenges in AI infrastructure. You'll collaborate with AI researchers and various hardware/software teams, contributing directly to making AI training and inference more reliable, scalable, and efficient. The company offers competitive compensation, including equity, and the opportunity to work on cutting-edge technology that's transforming industries.

The ideal candidate brings 6+ years of experience, strong distributed systems knowledge, and familiarity with AI frameworks. Experience with CUDA, NCCL, or MPI for GPU-accelerated computing at extreme scale is highly valued. This role offers the chance to impact the future of AI computing while working with state-of-the-art technology and brilliant colleagues.

Last updated 2 days ago

Responsibilities For Senior Software Engineer, AI Resiliency

  • Implement and optimize software features for AI system reliability at massive scale
  • Contribute to large-scale distributed systems with C++ and Python code
  • Work on AI system error handling and failure detection
  • Collaborate with teams to integrate resiliency features into AI frameworks
  • Develop and implement tests for robustness and scalability
  • Debug and performance tune large-scale AI workloads

Requirements For Senior Software Engineer, AI Resiliency

Python
  • Bachelor's, Master's or PhD in Computer Science, Electrical Engineering, or related field
  • Proficiency in C++ and Python
  • 6+ years of relevant experience
  • Strong understanding of distributed systems concepts
  • Familiarity with AI frameworks like PyTorch, JAX/XLA, TensorFlow
  • Experience with debugging and profiling tools
  • Excellent problem-solving skills

Benefits For Senior Software Engineer, AI Resiliency

Equity
  • Equity

Jobs Related To NVIDIA Senior Software Engineer, AI Resiliency