Taro Logo

Senior AI-HPC Cluster Engineer - MLOps

World leader in accelerated computing, pioneering AI and digital twins technology to transform industries.
$184,000 - $356,500
DevOps
Senior Software Engineer
Hybrid
5,000+ Employees
6+ years of experience
AI · Enterprise SaaS

Job Description

NVIDIA, a pioneer in accelerated computing and GPU technology, is seeking a Senior AI-HPC Cluster Engineer to join their MLOps team. This role sits at the intersection of high-performance computing and artificial intelligence, focusing on building and optimizing large-scale GPU compute clusters for deep learning and HPC workloads.

The position offers an opportunity to work with cutting-edge technology at a company that has been transforming computer graphics and accelerated computing for over 25 years. As an NVIDIAN, you'll be responsible for designing and implementing GPU compute clusters, developing scalable automation solutions, and supporting researchers in optimizing their workloads.

The ideal candidate brings 6+ years of experience in large-scale compute infrastructure, strong programming skills in both scripting (Python, Bash) and compiled languages (Go, Rust, C++), and deep expertise in Linux systems and container technologies. Knowledge of AI/HPC job schedulers, performance tuning, and distributed systems is essential.

This role offers competitive compensation with a base salary ranging from $184,000 to $356,500 USD (depending on level), plus equity and comprehensive benefits. You'll be joining a diverse, supportive environment where innovation is celebrated and you can make a lasting impact on the world of AI and high-performance computing.

The position offers flexibility with locations in major tech hubs like Santa Clara, CA and Austin, TX, with hybrid work options available. You'll be working with state-of-the-art technology including NVIDIA GPUs, CUDA programming, and modern ML frameworks, while collaborating with talented researchers and engineers to push the boundaries of what's possible in AI and HPC.

Last updated 11 hours ago

Responsibilities For Senior AI-HPC Cluster Engineer - MLOps

  • Provide leadership and strategic mentorship on management of large-scale HPC systems
  • Develop and improve GPU-accelerated computing ecosystem including scalable automation solutions
  • Build and maintain customer and cross-team relationships
  • Support researchers in running workloads including performance analysis and optimizations
  • Conduct root cause analysis and suggest corrective action
  • Build innovative tooling to accelerate researchers' velocity and software performance

Requirements For Senior AI-HPC Cluster Engineer - MLOps

Python
Kubernetes
Linux
Go
Rust
  • Bachelor's degree in Computer Science, Electrical Engineering or related field
  • 6+ years experience with large scale compute infrastructure
  • Experience with AI/HPC job schedulers (Slurm, K8s, LSF)
  • Proficient in Linux (Centos/RHEL/Ubuntu)
  • Container technologies knowledge (Enroot, Docker, Podman)
  • Proficiency in Python/Bash and Go/Rust/C/C++
  • Experience analyzing and tuning AI/HPC workloads
  • Excellent problem-solving and communication skills

Benefits For Senior AI-HPC Cluster Engineer - MLOps

Equity
Medical Insurance
  • Equity
  • Medical Insurance

Related Jobs

Senior Software Release Engineer, Holoscan

Senior Software Release Engineer position at NVIDIA's Holoscan team, focusing on build automation, release management, and DevOps practices with competitive compensation and hybrid work options.

Senior System Software Engineer - DevOps and Infrastructure Automation

Senior DevOps Engineer role at NVIDIA focusing on AI infrastructure automation and CI/CD pipeline management, offering competitive compensation and the opportunity to work with cutting-edge technology.

Senior Software Engineer - Bare Metal DevOps

Senior Software Engineer position at NVIDIA focusing on Bare Metal DevOps, managing infrastructure and developing solutions for AI workloads using Kubernetes, Rust, Go, and Python.

Senior Software Engineer - Bare Metal DevOps

Senior Software Engineer role at NVIDIA focusing on Bare Metal DevOps, managing infrastructure and Kubernetes clusters for AI workloads.

SWQA Tools Development Engineer

Senior SWQA Tools Development Engineer position at NVIDIA, focusing on certification testing and automation tool development using AI/ML technologies.