Taro Logo

Senior Datacenter Resiliency Architect

NVIDIA is the world leader in accelerated computing, pioneering GPU technology and AI solutions.
$184,000 - $356,500
Backend
Staff Software Engineer
In-Person
5,000+ Employees
5+ years of experience
AI · Enterprise SaaS

Description For Senior Datacenter Resiliency Architect

NVIDIA, the world leader in accelerated computing, is seeking a Senior Datacenter Resiliency (RAS) Architect to join their innovative team. This role sits at the intersection of hardware and software development, focusing on building resilient systems for AI and high-performance computing.

The position involves architecting critical resiliency features for NVIDIA's industry-leading Datacenter GPUs and SOCs, working on systems that power artificial intelligence and high-performance computing applications. You'll be responsible for modeling and analyzing RAS metrics, developing verification testplans, and creating CUDA software diagnostics for GPU clusters.

As a Senior Datacenter Resiliency Architect, you'll collaborate with cross-functional teams including architects, unit designers, and software engineers. The role requires a deep understanding of GPU architecture, computer systems, and a strong foundation in programming with languages like C++, Python, and CUDA.

The position offers an exciting opportunity to work at the forefront of AI computing technology. NVIDIA's invention of the GPU in 1999 revolutionized parallel computing, and more recently, their GPU deep learning innovations have ignited modern AI, powering AI factories, robots, and self-driving cars.

The compensation package is highly competitive, with a base salary range of $184,000 to $356,500, plus equity and comprehensive benefits. This is a chance to join a company that's driving innovation in AI and digital twins, transforming major industries and making a significant impact on society.

The ideal candidate will have 5+ years of relevant experience, advanced degree in Computer or Electrical Engineering, and expertise in GPU architecture or RAS features. Strong programming skills, analytical capabilities, and excellent collaboration abilities are essential for success in this role.

Working at NVIDIA means joining a learning organization that constantly evolves and tackles exciting challenges that matter to the world. The company culture emphasizes innovation, pushing boundaries, and working at the highest level of one's craft.

Last updated a month ago

Responsibilities For Senior Datacenter Resiliency Architect

  • Architect hardware and software Resiliency features to improve system Reliability, Availability, Serviceability (RAS)
  • Model and analyze RAS metrics like Failures in Time for permanent and transient errors
  • Collaborate with architects, unit designers and software engineers
  • Develop and implement comprehensive architecture verification testplans
  • Execute Architecture Testplan and debug tests
  • Run simulations to analyze Architectural Vulnerability Factor
  • Develop CUDA software diagnostics kernels
  • Develop and automate fault models

Requirements For Senior Datacenter Resiliency Architect

Python
  • Master's or PhD degree in Computer Engineering, Electrical Engineering or related field
  • 5+ years of relevant experience
  • Familiarity with GPU and Networking Architectures
  • Strong knowledge in GPU hardware architecture or RAS features
  • Proficiency in developing Architecture models
  • Scripting and automation with Python or similar
  • Proficiency in C/C++
  • Excellent interpersonal skills
  • Strong debugging and analytical skills

Benefits For Senior Datacenter Resiliency Architect

Medical Insurance
Equity
  • Equity
  • Medical Benefits

Interested in this job?

Jobs Related To NVIDIA Senior Datacenter Resiliency Architect