Principal Engineer for AI Software Resiliency

World leader in accelerated computing, pioneering AI and digital twins technology to transform industries.
$272,000 - $425,500
Machine Learning
Principal Software Engineer
In-Person
10+ years of experience
AI · Enterprise SaaS
This job posting may no longer be active. You may be interested in these related jobs instead:
Principal AI and ML Engineer — AI for Networking

Principal AI/ML Engineering role at NVIDIA focusing on AI for Networking, offering $248K-$391K plus equity. Lead AI infrastructure initiatives and mentor teams in a hybrid work environment.

Principal Perception Engineer

Principal Perception Engineer role at NVIDIA focusing on developing autonomous driving solutions using deep learning and computer vision, offering competitive compensation and the opportunity to work on cutting-edge technology.

Distinguished Planning Machine Learning Engineer - Autonomous Vehicles

Lead ML engineer position at NVIDIA developing autonomous vehicle technology, requiring 25+ years experience and deep expertise in machine learning, neural networks, and autonomous systems.

Principal Prediction and Planning Machine Learning Engineer - Autonomous Vehicles

Lead ML engineer position at NVIDIA focusing on autonomous vehicle prediction and planning, requiring 15+ years of experience and deep expertise in machine learning and neural networks.

Distinguished Planning Machine Learning Engineer - Autonomous Vehicles

Distinguished level machine learning engineering role focused on planning systems for autonomous vehicles at NVIDIA.

Description For Principal Engineer for AI Software Resiliency

NVIDIA is seeking a Principal Software Engineer to spearhead AI software resiliency development for the world's most powerful AI supercomputers. This role is at the forefront of pushing AI computing boundaries, focusing on systems operating at a massive scale of 100,000+ GPUs. The position demands expertise in distributed systems and AI infrastructure, combining technical leadership with hands-on development.

The role involves architecting and implementing critical resiliency features for AI supercomputers, including checkpoint-recovery systems, error detection mechanisms, and performance optimization. You'll work directly with major customers and cross-functional teams to integrate these features into frameworks like PyTorch and JAX/XLA.

As a Principal Engineer, you'll lead by example in engineering excellence, fostering innovation while ensuring high code quality and rigorous testing standards. The position requires deep technical expertise combined with strong collaborative skills to work effectively across multiple engineering disciplines.

NVIDIA offers a competitive compensation package, including a base salary range of $272,000-$425,500, plus equity. The company is recognized as one of the world's most desirable technology employers, known for its pioneering work in AI computing and GPU technology. This role presents an exceptional opportunity to impact the future of AI computing infrastructure while working with cutting-edge technology and industry-leading experts.

The ideal candidate will bring extensive experience in distributed systems, AI frameworks, and large-scale infrastructure, along with a passion for developing AI-specific system architectures. This role is perfect for someone who thrives on solving complex technical challenges and wants to be at the forefront of AI technology advancement.

Last updated 4 months ago

Responsibilities For Principal Engineer for AI Software Resiliency

  • Serve as a trusted authority on AI software resiliency
  • Lead execution and development of software resiliency features
  • Drive engineering excellence and contribute to large software codebases
  • Work closely with multiple teams across NVIDIA
  • Collaborate directly with major customers
  • Partner with TPMs, PMs, and QA teams for feature launches

Requirements For Principal Engineer for AI Software Resiliency

Python
  • Master's or Ph.D. in Computer Science, Electrical Engineering, Computer Engineering, or related field
  • Minimum 10 years of experience in systems architecture or related fields
  • At least 10 years of hands-on experience in software development for distributed systems
  • 5 years in developing AI frameworks such as PyTorch or JAX/XLA
  • Proven track record of working effectively across multiple engineering fields
  • Deep understanding of distributed systems and large-scale AI infrastructure

Benefits For Principal Engineer for AI Software Resiliency

Equity
  • Equity

Interested in this job?