Taro Logo

AI and HPC Cluster Group Manager

NVIDIA is the world leader in accelerated computing, pioneering AI and digital twins to transform industries and society.
Cloud
Staff Software Engineer
In-Person
5,000+ Employees
10+ years of experience
AI · Enterprise SaaS
This job posting is no longer active.

Job Description

NVIDIA is seeking an AI & HPC Clusters' group manager to join their Cloud Solutions group. In this role, you will lead a team responsible for building, managing, and maintaining the largest cluster in NVIDIA Networking R&D to validate and test next-generation networking cloud technology and Reference Architecture. You'll work on next-generation BlackWell GPU Platform AI clouds with XDR (800G InfiniBand) and SpectrumX800 technology.

Key responsibilities include:

  • Leading a group managing SW R&D clusters with various systems and technologies
  • Collaborating with engineering and architecture teams to plan and build new clusters
  • Driving the design and implementation of automatic systems for cluster management
  • Implementing resource management systems for multiuser environments
  • Managing R&D lab including inventory, power, space, and cooling
  • Building and mentoring the team to address growing demands
  • Innovating and influencing NVIDIA Networking cluster management tools

Requirements:

  • Degree in Computer Science, Engineering, or related field
  • 5+ years of managerial experience, including managing managers
  • 10+ years of relevant professional experience
  • Experience in Data center management and HPC/AI clusters
  • Deep understanding of operating systems, computer networks, and high-performance hardware
  • Knowledge of distributed resource scheduling systems and orchestration tools
  • Strong organizational and project management skills

Preferred qualifications:

  • Knowledge of HPC and AI solution technologies
  • Familiarity with CUDA and managing GPU-accelerated computing systems
  • Experience with InfiniBand

NVIDIA offers a diverse and supportive work environment, fostering innovation and creativity. They are an equal-opportunity employer committed to diversity and inclusion.

Last updated 10 months ago

Responsibilities For AI and HPC Cluster Group Manager

  • Lead a group managing SW R&D clusters with various systems and technologies
  • Collaborate with engineering and architecture teams to plan and build new clusters
  • Drive the design and implementation of automatic systems for cluster management
  • Implement resource management systems for multiuser environments
  • Manage R&D lab including inventory, power, space, and cooling
  • Build and mentor the team to address growing demands
  • Innovate and influence NVIDIA Networking cluster management tools

Requirements For AI and HPC Cluster Group Manager

Linux
Kubernetes
  • Degree in Computer Science, Engineering, or related field
  • 5+ years of managerial experience, including managing managers
  • 10+ years of relevant professional experience
  • Experience in Data center management and HPC/AI clusters
  • Deep understanding of operating systems, computer networks, and high-performance hardware
  • Knowledge of distributed resource scheduling systems and orchestration tools
  • Strong organizational and project management skills