Senior Software Engineer, Server Manageability FMEA

NVIDIA

NVIDIA is the world leader in accelerated computing, pioneering solutions for AI and digital twins that transform industries and society.

Santa Clara, CA, USA

$220,000 - $419,750

Backend

Senior Software Engineer

Hybrid

5,000+ Employees

12+ years of experience

AI · Enterprise SaaS

This job posting may no longer be active. You may be interested in these related jobs instead:

Description For Senior Software Engineer, Server Manageability FMEA

NVIDIA is seeking a talented and experienced Senior Software Engineer for Server Manageability FMEA (Failure Mode and Effects Analysis). This role is crucial in improving the reliability of NVIDIA GPU and Grace systems through comprehensive failure analysis and architecting fault-resilient software and firmware.

Key Responsibilities:

Design and develop server-level FMEA for NVIDIA's Data Center products
Define server-level reliability, availability, and serviceability requirements
Collaborate with cross-functional teams to identify potential failure points and propose mitigation strategies
Develop fault detection, isolation, and recovery mechanisms
Design redundancy and fault-tolerant mechanisms
Evaluate and select appropriate technologies to optimize RAS (Reliability, Availability, and Serviceability)
Conduct system and cluster level simulations and testing
Stay updated with the latest advancements in RAS techniques and industry trends
Work with NVIDIA partners on RAS-related architecture discussions

Required Qualifications:

BS, MS, or PhD in EE/CS or related field with 12+ years of experience
Strong programming skills in C/C++ in Linux environment
Expertise in system-level architecture design and reliability engineering
Proficiency in scale-out architectures and fault-tolerant design principles
Experience with system-level simulation tools and methodologies
Excellent problem-solving skills and attention to detail

The ideal candidate will have a proven record of conducting system-level FMEA, familiarity with machine check architecture, and hands-on experience with x86 or ARM system architecture.

NVIDIA offers a competitive base salary range of $220,000 - $419,750 USD, along with equity and comprehensive benefits. Join us at the forefront of technological advancement and help shape the future of computing!

Last updated 8 months ago

Responsibilities For Senior Software Engineer, Server Manageability FMEA

Design and develop server level FMEA for NVIDIA's Data Center products
Define server-level reliability, availability, and serviceability requirements
Collaborate with cross-functional teams to identify potential points of failure
Develop fault detection, isolation, and recovery mechanisms
Design redundancy and fault-tolerant mechanisms
Evaluate and select appropriate technologies to optimize RAS
Conduct system and cluster level simulations and testing
Stay up-to-date with the latest advancements in RAS techniques
Work with NVIDIA partners on RAS related architecture discussions

Requirements For Senior Software Engineer, Server Manageability FMEA

Linux

BS, MS, or PhD in EE/CS or related field with 12+ years of experience
Strong programming in C/C++ in Linux operating environment
Strong understanding of Linux kernel internals
Expertise in system-level architecture design and reliability engineering
Proficiency in scale-out architectures
Experience with fault-tolerant design principles
Proficiency in system-level simulation tools and methodologies
Excellent problem-solving skills and attention to detail
Excellent written and oral communication skills

Benefits For Senior Software Engineer, Server Manageability FMEA

Equity

Equity
Comprehensive benefits package