NVIDIA is seeking an AI & HPC Clusters' group manager to join their Cloud Solutions group. In this role, you will lead a team responsible for building, managing, and maintaining the largest cluster in NVIDIA Networking R&D to validate and test next-generation networking cloud technology and Reference Architecture. You'll work on next-generation BlackWell GPU Platform AI clouds with XDR (800G InfiniBand) and SpectrumX800 technology.
Key responsibilities include:
- Leading a group managing SW R&D clusters with various systems and technologies
- Collaborating with engineering and architecture teams to plan and build new clusters
- Driving the design and implementation of automatic systems for cluster management
- Implementing resource management systems for multiuser environments
- Managing R&D lab including inventory, power, space, and cooling
- Building and mentoring the team to address growing demands
- Innovating and influencing NVIDIA Networking cluster management tools
Requirements:
- Degree in Computer Science, Engineering, or related field
- 5+ years of managerial experience, including managing managers
- 10+ years of relevant professional experience
- Experience in Data center management and HPC/AI clusters
- Deep understanding of operating systems, computer networks, and high-performance hardware
- Knowledge of distributed resource scheduling systems and orchestration tools
- Strong organizational and project management skills
Preferred qualifications:
- Knowledge of HPC and AI solution technologies
- Familiarity with CUDA and managing GPU-accelerated computing systems
- Experience with InfiniBand
NVIDIA offers a diverse and supportive work environment, fostering innovation and creativity. They are an equal-opportunity employer committed to diversity and inclusion.