NVIDIA, the world leader in accelerated computing, is seeking a Senior Software Engineer to spearhead AI software resiliency development for the world's most powerful AI supercomputers. This role is crucial in the AI Software Resiliency team, focusing on implementing critical features for AI systems operating at a scale of 100,000+ GPUs. The position involves developing sophisticated resiliency features, optimizing system performance, and ensuring near-zero downtime for massive AI clusters.
The role combines deep technical expertise in distributed systems, AI frameworks, and high-performance computing. You'll work on implementing essential features like fast checkpoint-recovery, error detection, and straggler/hang detection. The position requires strong coding skills in C++ and Python, with a focus on production-level code that can handle AI workloads across thousands of GPUs.
Working at NVIDIA means joining a team of world-class engineers tackling the hardest challenges in AI infrastructure. You'll collaborate with AI researchers and various hardware/software teams, contributing directly to making AI training and inference more reliable, scalable, and efficient. The company offers competitive compensation, including equity, and the opportunity to work on cutting-edge technology that's transforming industries.
The ideal candidate brings 6+ years of experience, strong distributed systems knowledge, and familiarity with AI frameworks. Experience with CUDA, NCCL, or MPI for GPU-accelerated computing at extreme scale is highly valued. This role offers the chance to impact the future of AI computing while working with state-of-the-art technology and brilliant colleagues.