NVIDIA is seeking a talented and experienced Senior Software Engineer for Server Manageability FMEA (Failure Mode and Effects Analysis). This role is crucial in improving the reliability of NVIDIA GPU and Grace systems through comprehensive failure analysis and architecting fault-resilient software and firmware.
Key Responsibilities:
- Design and develop server-level FMEA for NVIDIA's Data Center products
- Define server-level reliability, availability, and serviceability requirements
- Collaborate with cross-functional teams to identify potential failure points and propose mitigation strategies
- Develop fault detection, isolation, and recovery mechanisms
- Design redundancy and fault-tolerant mechanisms
- Evaluate and select appropriate technologies to optimize RAS (Reliability, Availability, and Serviceability)
- Conduct system and cluster level simulations and testing
- Stay updated with the latest advancements in RAS techniques and industry trends
- Work with NVIDIA partners on RAS-related architecture discussions
Required Qualifications:
- BS, MS, or PhD in EE/CS or related field with 12+ years of experience
- Strong programming skills in C/C++ in Linux environment
- Expertise in system-level architecture design and reliability engineering
- Proficiency in scale-out architectures and fault-tolerant design principles
- Experience with system-level simulation tools and methodologies
- Excellent problem-solving skills and attention to detail
The ideal candidate will have a proven record of conducting system-level FMEA, familiarity with machine check architecture, and hands-on experience with x86 or ARM system architecture.
NVIDIA offers a competitive base salary range of $220,000 - $419,750 USD, along with equity and comprehensive benefits. Join us at the forefront of technological advancement and help shape the future of computing!