NVIDIA, known as "the AI computing company," is seeking a talented and experienced RAS (Reliability, Availability, and Serviceability) Architect for their DGX, HGX, and MGX systems, which deliver world-leading solutions for enterprise AI infrastructure at scale. The role involves designing, architecting, and implementing robust RAS features to improve the reliability of NVIDIA GPU and Grace systems.
Key responsibilities include:
- Designing server-level RAS for NVIDIA's data center products
- Defining RAS requirements for scale-out environments
- Developing fault detection, isolation, and recovery mechanisms
- Designing redundancy and fault-tolerant mechanisms
- Collaborating with customers, vendors, and suppliers
- Conducting system and cluster level simulations and analysis
- Staying updated with the latest RAS techniques and industry trends
- Working with NVIDIA partners on RAS-related architecture
- Contributing to all phases of product development
The ideal candidate should have:
- BS, MS, or PhD in EE/CS or related field with 15+ years of experience
- Strong programming skills in C/C++ and Linux environments
- Expertise in system-level architecture design and reliability engineering
- Proficiency in scale-out architectures
- Experience with fault-tolerant design principles
- Skills in system-level simulation tools and methodologies
- Excellent problem-solving and communication skills
This role offers the opportunity to work at the forefront of technological advancement, contributing to the next generation of computing. NVIDIA provides a competitive base salary range of $272,000 - $419,750 USD, along with equity and comprehensive benefits.