NVIDIA, the world leader in accelerated computing, is seeking a Senior Datacenter Resiliency (RAS) Architect to join their innovative team. This role combines hardware and software architecture expertise to enhance the reliability and performance of NVIDIA's cutting-edge datacenter GPUs and SOCs. The position focuses on developing resilient computing solutions for AI and high-performance computing applications.
The role requires deep technical expertise in computer architecture, particularly in GPU systems and reliability engineering. You'll be responsible for architecting hardware and software features that improve system reliability, analyzing complex metrics, and developing comprehensive verification systems. This position offers the opportunity to work with state-of-the-art technology and directly impact the future of AI computing infrastructure.
The ideal candidate will bring a strong academic background (Master's or PhD) in Computer or Electrical Engineering, combined with 5+ years of relevant experience. Expertise in GPU architecture, RAS features, and programming skills in Python, C++, and CUDA are essential. The position offers competitive compensation ranging from $184,000 to $356,500, plus equity and benefits.
This is an exciting opportunity to join NVIDIA's Accelerated and Resilient Compute Systems team, working at the intersection of hardware and software to build reliable, high-performance computing platforms. The role is perfect for someone passionate about pushing the boundaries of computing technology and contributing to the advancement of AI and HPC infrastructure.