NVIDIA is hiring engineers to scale up its AI Infrastructure for the DGX Cloud project. The role involves designing and architecting a comprehensive platform that automates GPU asset provisioning, configuration, and lifecycle management across cloud providers. Key responsibilities include:
- Developing, testing, and optimizing solutions for Datacenter firmware throughout its lifecycle
- Collaborating with hardware, software, infrastructure, and business teams to implement new firmware features
- Defining server-level reliability, availability, and serviceability requirements
- Working on fault-resilient solutions at scale
- Ensuring seamless integration of software from hardware to AI training applications
The ideal candidate should have:
- BS, MS, or PhD in EE/CS or related field with 6+ years of experience in Python development on Linux
- Strong communication skills and ability to work with multi-functional teams
- Familiarity with industry standards like SPI, I2C, PCIe, UEFI, and PLDM
- Expert knowledge of systems programming languages (Go, Python) and understanding of Data Structures and Algorithms
- Experience with distributed systems, including performance, security, and reliability aspects
Additional valuable skills include:
- Understanding of machine check architecture and error flows
- Familiarity with Linux server design, x86/ARM architecture, and various interconnects
- Experience in designing and maintaining cloud AI infrastructure
NVIDIA offers a competitive base salary range of $148,000 - $230,000 USD, along with equity and benefits. They are committed to fostering a diverse work environment and are an equal opportunity employer.