Senior System Software Engineer, Distributed Systems - DGX Cloud

NVIDIA is the world leader in accelerated computing, pioneering solutions for challenges no one else can solve.
$148,000 - $230,000
Distributed Systems
Senior Software Engineer
Hybrid
6+ years of experience
AI · Enterprise SaaS · Cloud

Description For Senior System Software Engineer, Distributed Systems - DGX Cloud

NVIDIA is hiring engineers to scale up its AI Infrastructure for the DGX Cloud project. The role involves designing and architecting a comprehensive platform that automates GPU asset provisioning, configuration, and lifecycle management across cloud providers. Key responsibilities include:

  • Developing, testing, and optimizing solutions for Datacenter firmware throughout its lifecycle
  • Collaborating with hardware, software, infrastructure, and business teams to implement new firmware features
  • Defining server-level reliability, availability, and serviceability requirements
  • Working on fault-resilient solutions at scale
  • Ensuring seamless integration of software from hardware to AI training applications

The ideal candidate should have:

  • BS, MS, or PhD in EE/CS or related field with 6+ years of experience in Python development on Linux
  • Strong communication skills and ability to work with multi-functional teams
  • Familiarity with industry standards like SPI, I2C, PCIe, UEFI, and PLDM
  • Expert knowledge of systems programming languages (Go, Python) and understanding of Data Structures and Algorithms
  • Experience with distributed systems, including performance, security, and reliability aspects

Additional valuable skills include:

  • Understanding of machine check architecture and error flows
  • Familiarity with Linux server design, x86/ARM architecture, and various interconnects
  • Experience in designing and maintaining cloud AI infrastructure

NVIDIA offers a competitive base salary range of $148,000 - $230,000 USD, along with equity and benefits. They are committed to fostering a diverse work environment and are an equal opportunity employer.

Last updated 7 days ago

Responsibilities For Senior System Software Engineer, Distributed Systems - DGX Cloud

  • Design and architect a platform for GPU asset provisioning and management
  • Develop, test, and optimize Datacenter firmware solutions
  • Collaborate with cross-functional teams on new firmware features
  • Define server-level reliability, availability, and serviceability requirements
  • Work on fault-resilient solutions at scale
  • Ensure seamless integration of software from hardware to AI applications

Requirements For Senior System Software Engineer, Distributed Systems - DGX Cloud

Python
Go
Linux
  • BS, MS, or PhD in EE/CS or related field (or equivalent experience)
  • 6+ years of experience in Python development on Linux
  • Strong communication skills
  • Familiarity with industry standards (SPI, I2C, PCIe, UEFI, PLDM)
  • Expert knowledge of systems programming languages (Go, Python)
  • Understanding of Data Structures and Algorithms
  • Experience with distributed systems

Interested in this job?

Jobs Related To NVIDIA Senior System Software Engineer, Distributed Systems - DGX Cloud

Senior AI-HPC Cluster Engineer

Senior AI-HPC Cluster Engineer at NVIDIA, leading GPU compute cluster design and implementation for AI and HPC workloads.

Senior GPU System Performance Architect

NVIDIA seeks a Senior GPU System Performance Architect to innovate in GPU-accelerated cloud computing, offering competitive salary and benefits.

Software Engineer R&D

Broadcom is seeking a Software Engineer R&D for their vSAN team to work on distributed data storage solutions.

Senior Fullstack Developer (Exchange/Trading Platform)

Senior Fullstack Developer for Crypto.com's exchange team, developing cutting-edge trading platforms with a focus on scalability and performance.

Senior Software Engineer, Distributed Database

Senior Software Engineer, Distributed Database role at authzed, focusing on open-source, Google Zanzibar-inspired permissions database.