Taro Logo

Senior Datacenter Resiliency Architect

NVIDIA is the world leader in accelerated computing, pioneering GPU technology and AI solutions.
$184,000 - $356,500
Backend
Staff Software Engineer
In-Person
5,000+ Employees
5+ years of experience
AI · Enterprise SaaS

Description For Senior Datacenter Resiliency Architect

NVIDIA, the world leader in accelerated computing, is seeking a Senior Datacenter Resiliency (RAS) Architect to join their innovative team. This role combines hardware and software architecture expertise to enhance the reliability and performance of NVIDIA's cutting-edge datacenter GPUs and SOCs. The position focuses on developing resilient computing solutions for AI and high-performance computing applications.

The role requires deep technical expertise in computer architecture, particularly in GPU systems and reliability engineering. You'll be responsible for architecting hardware and software features that improve system reliability, analyzing complex metrics, and developing comprehensive verification systems. This position offers the opportunity to work with state-of-the-art technology and directly impact the future of AI computing infrastructure.

The ideal candidate will bring a strong academic background (Master's or PhD) in Computer or Electrical Engineering, combined with 5+ years of relevant experience. Expertise in GPU architecture, RAS features, and programming skills in Python, C++, and CUDA are essential. The position offers competitive compensation ranging from $184,000 to $356,500, plus equity and benefits.

This is an exciting opportunity to join NVIDIA's Accelerated and Resilient Compute Systems team, working at the intersection of hardware and software to build reliable, high-performance computing platforms. The role is perfect for someone passionate about pushing the boundaries of computing technology and contributing to the advancement of AI and HPC infrastructure.

Last updated 16 minutes ago

Responsibilities For Senior Datacenter Resiliency Architect

  • Architect hardware and software Resiliency features to improve system Reliability, Availability, Serviceability (RAS)
  • Model and analyze RAS metrics like Failures in Time for permanent and transient errors
  • Collaborate with architects, unit designers and software engineers
  • Develop and implement comprehensive architecture verification testplans
  • Execute Architecture Testplan and debug tests
  • Run simulations to analyze Architectural Vulnerability Factor
  • Develop CUDA software diagnostics kernels
  • Develop and automate fault models

Requirements For Senior Datacenter Resiliency Architect

Python
  • Master's or PhD degree in Computer Engineering, Electrical Engineering or related field
  • At least 5+ years of relevant experience
  • Familiarity with GPU and Networking Architectures
  • Strong knowledge in GPU hardware architecture or RAS features
  • Proficiency in developing Architecture models
  • Scripting and automation with Python or similar
  • Proficiency in C/C++
  • Excellent interpersonal skills
  • Strong debugging and analytical skills

Benefits For Senior Datacenter Resiliency Architect

Medical Insurance
Equity
  • Equity
  • Medical Benefits

Interested in this job?

Jobs Related To NVIDIA Senior Datacenter Resiliency Architect

Senior Staff Software Engineer - Observability and Monitoring

Senior Staff Software Engineer position at NVIDIA focusing on Observability and Monitoring, offering competitive compensation and the opportunity to work with cutting-edge technologies.

Staff Integration Engineer

Staff Integration Engineer position at NVIDIA, leading enterprise integration and API platform development for AI applications and autonomous vehicle systems.

Lead Software Engineer

Lead Software Engineer role at Disney Entertainment building scalable backend services for streaming platforms like Disney+ and Hulu. 7+ years experience required.

Staff Software Engineer, Media Infrastructure

Staff Software Engineer position at LinkedIn focusing on media infrastructure and video processing optimization.

Staff Software Engineer - Video Processing

Staff Software Engineer position at LinkedIn focusing on video processing and optimization, working with media infrastructure and codec integration.