Taro Logo

Senior Software Engineer, Server Manageability FMEA

NVIDIA is the world leader in accelerated computing, pioneering solutions for AI and digital twins that transform industries and society.
$220,000 - $419,750
Backend
Senior Software Engineer
Hybrid
5,000+ Employees
12+ years of experience
AI · Enterprise SaaS
This job posting may no longer be active. You may be interested in these related jobs instead:

Description For Senior Software Engineer, Server Manageability FMEA

NVIDIA is seeking a talented and experienced Senior Software Engineer for Server Manageability FMEA (Failure Mode and Effects Analysis). This role is crucial in improving the reliability of NVIDIA GPU and Grace systems through comprehensive failure analysis and architecting fault-resilient software and firmware.

Key Responsibilities:

  • Design and develop server-level FMEA for NVIDIA's Data Center products
  • Define server-level reliability, availability, and serviceability requirements
  • Collaborate with cross-functional teams to identify potential failure points and propose mitigation strategies
  • Develop fault detection, isolation, and recovery mechanisms
  • Design redundancy and fault-tolerant mechanisms
  • Evaluate and select appropriate technologies to optimize RAS (Reliability, Availability, and Serviceability)
  • Conduct system and cluster level simulations and testing
  • Stay updated with the latest advancements in RAS techniques and industry trends
  • Work with NVIDIA partners on RAS-related architecture discussions

Required Qualifications:

  • BS, MS, or PhD in EE/CS or related field with 12+ years of experience
  • Strong programming skills in C/C++ in Linux environment
  • Expertise in system-level architecture design and reliability engineering
  • Proficiency in scale-out architectures and fault-tolerant design principles
  • Experience with system-level simulation tools and methodologies
  • Excellent problem-solving skills and attention to detail

The ideal candidate will have a proven record of conducting system-level FMEA, familiarity with machine check architecture, and hands-on experience with x86 or ARM system architecture.

NVIDIA offers a competitive base salary range of $220,000 - $419,750 USD, along with equity and comprehensive benefits. Join us at the forefront of technological advancement and help shape the future of computing!

Last updated 8 months ago

Responsibilities For Senior Software Engineer, Server Manageability FMEA

  • Design and develop server level FMEA for NVIDIA's Data Center products
  • Define server-level reliability, availability, and serviceability requirements
  • Collaborate with cross-functional teams to identify potential points of failure
  • Develop fault detection, isolation, and recovery mechanisms
  • Design redundancy and fault-tolerant mechanisms
  • Evaluate and select appropriate technologies to optimize RAS
  • Conduct system and cluster level simulations and testing
  • Stay up-to-date with the latest advancements in RAS techniques
  • Work with NVIDIA partners on RAS related architecture discussions

Requirements For Senior Software Engineer, Server Manageability FMEA

Linux
  • BS, MS, or PhD in EE/CS or related field with 12+ years of experience
  • Strong programming in C/C++ in Linux operating environment
  • Strong understanding of Linux kernel internals
  • Expertise in system-level architecture design and reliability engineering
  • Proficiency in scale-out architectures
  • Experience with fault-tolerant design principles
  • Proficiency in system-level simulation tools and methodologies
  • Excellent problem-solving skills and attention to detail
  • Excellent written and oral communication skills

Benefits For Senior Software Engineer, Server Manageability FMEA

Equity
  • Equity
  • Comprehensive benefits package

Interested in this job?