Taro Logo

Principal Engineer, Server RAS

NVIDIA is the world leader in accelerated computing, pioneering solutions for challenges no one else can solve.
$272,000 - $419,750
Principal Software Engineer
Hybrid
15+ years of experience
AI · Enterprise SaaS
This job posting may no longer be active. You may be interested in these related jobs instead:

Description For Principal Engineer, Server RAS

NVIDIA, known as "the AI computing company," is seeking a talented and experienced RAS (Reliability, Availability, and Serviceability) Architect for their DGX, HGX, and MGX systems, which deliver world-leading solutions for enterprise AI infrastructure at scale. The role involves designing, architecting, and implementing robust RAS features to improve the reliability of NVIDIA GPU and Grace systems.

Key responsibilities include:

  • Designing server-level RAS for NVIDIA's data center products
  • Defining RAS requirements for scale-out environments
  • Developing fault detection, isolation, and recovery mechanisms
  • Designing redundancy and fault-tolerant mechanisms
  • Collaborating with customers, vendors, and suppliers
  • Conducting system and cluster level simulations and analysis
  • Staying updated with the latest RAS techniques and industry trends
  • Working with NVIDIA partners on RAS-related architecture
  • Contributing to all phases of product development

The ideal candidate should have:

  • BS, MS, or PhD in EE/CS or related field with 15+ years of experience
  • Strong programming skills in C/C++ and Linux environments
  • Expertise in system-level architecture design and reliability engineering
  • Proficiency in scale-out architectures
  • Experience with fault-tolerant design principles
  • Skills in system-level simulation tools and methodologies
  • Excellent problem-solving and communication skills

This role offers the opportunity to work at the forefront of technological advancement, contributing to the next generation of computing. NVIDIA provides a competitive base salary range of $272,000 - $419,750 USD, along with equity and comprehensive benefits.

Last updated 8 months ago

Responsibilities For Principal Engineer, Server RAS

  • Design, architect, and deliver server-level RAS for NVIDIA's data center products
  • Define RAS requirements for scale-out environments
  • Develop fault detection, isolation, and recovery mechanisms
  • Design redundancy and fault-tolerant mechanisms
  • Collaborate with customers, vendors, and suppliers
  • Conduct system and cluster level simulations, analysis, and testing
  • Stay up to date with the latest advancements in RAS techniques
  • Work with NVIDIA partners on RAS-related architecture
  • Contribute to all phases of product development

Requirements For Principal Engineer, Server RAS

Linux
  • BS, MS, or PhD in EE/CS or related field with 15+ years of experience
  • Strong programming in C/C++ in Linux operating environment
  • Strong understanding of Linux kernel internals
  • Strong code review skills
  • Expertise in system-level architecture design and reliability engineering
  • Proficiency in scale-out architectures
  • Experience with fault-tolerant design principles
  • Proficiency in system-level simulation tools and methodologies
  • Excellent problem-solving skills and attention to detail
  • Excellent written and oral communication skills

Benefits For Principal Engineer, Server RAS

Equity
  • Equity

Interested in this job?