Senior AI-HPC Cluster Engineer - MLOps

NVIDIA

World leader in accelerated computing, pioneering AI and digital twins technology to transform industries.

Santa Clara, CA, USA • Austin, TX, USA

$184,000 - $356,500

DevOps

Senior Software Engineer

Hybrid

5,000+ Employees

6+ years of experience

AI · Enterprise SaaS

Job Description

NVIDIA, a pioneer in accelerated computing and GPU technology, is seeking a Senior AI-HPC Cluster Engineer to join their MLOps team. This role sits at the intersection of high-performance computing and artificial intelligence, focusing on building and optimizing large-scale GPU compute clusters for deep learning and HPC workloads.

The position offers an opportunity to work with cutting-edge technology at a company that has been transforming computer graphics and accelerated computing for over 25 years. As an NVIDIAN, you'll be responsible for designing and implementing GPU compute clusters, developing scalable automation solutions, and supporting researchers in optimizing their workloads.

The ideal candidate brings 6+ years of experience in large-scale compute infrastructure, strong programming skills in both scripting (Python, Bash) and compiled languages (Go, Rust, C++), and deep expertise in Linux systems and container technologies. Knowledge of AI/HPC job schedulers, performance tuning, and distributed systems is essential.

This role offers competitive compensation with a base salary ranging from $184,000 to $356,500 USD (depending on level), plus equity and comprehensive benefits. You'll be joining a diverse, supportive environment where innovation is celebrated and you can make a lasting impact on the world of AI and high-performance computing.

The position offers flexibility with locations in major tech hubs like Santa Clara, CA and Austin, TX, with hybrid work options available. You'll be working with state-of-the-art technology including NVIDIA GPUs, CUDA programming, and modern ML frameworks, while collaborating with talented researchers and engineers to push the boundaries of what's possible in AI and HPC.

Last updated 11 hours ago

Responsibilities For Senior AI-HPC Cluster Engineer - MLOps

Provide leadership and strategic mentorship on management of large-scale HPC systems
Develop and improve GPU-accelerated computing ecosystem including scalable automation solutions
Build and maintain customer and cross-team relationships
Support researchers in running workloads including performance analysis and optimizations
Conduct root cause analysis and suggest corrective action
Build innovative tooling to accelerate researchers' velocity and software performance

Requirements For Senior AI-HPC Cluster Engineer - MLOps

Python

Kubernetes

Linux

Rust

Bachelor's degree in Computer Science, Electrical Engineering or related field
6+ years experience with large scale compute infrastructure
Experience with AI/HPC job schedulers (Slurm, K8s, LSF)
Proficient in Linux (Centos/RHEL/Ubuntu)
Container technologies knowledge (Enroot, Docker, Podman)
Proficiency in Python/Bash and Go/Rust/C/C++
Experience analyzing and tuning AI/HPC workloads
Excellent problem-solving and communication skills

Benefits For Senior AI-HPC Cluster Engineer - MLOps

Equity

Medical Insurance

Equity
Medical Insurance

NVIDIA

World leader in accelerated computing, pioneering AI and digital twins technology to transform industries.

Santa Clara, CA, USA • Austin, TX, USA

$184,000 - $356,500

DevOps

Senior Software Engineer

Hybrid

5,000+ Employees

6+ years of experience

AI · Enterprise SaaS

NVIDIA

Senior Software Release Engineer position at NVIDIA's Holoscan team, focusing on build automation, release management, and DevOps practices with competitive compensation and hybrid work options.

Senior System Software Engineer - DevOps and Infrastructure Automation

NVIDIA

Senior DevOps Engineer role at NVIDIA focusing on AI infrastructure automation and CI/CD pipeline management, offering competitive compensation and the opportunity to work with cutting-edge technology.

Senior Software Engineer - Bare Metal DevOps

NVIDIA

Senior Software Engineer position at NVIDIA focusing on Bare Metal DevOps, managing infrastructure and developing solutions for AI workloads using Kubernetes, Rust, Go, and Python.

Senior Software Engineer - Bare Metal DevOps

NVIDIA

Senior Software Engineer role at NVIDIA focusing on Bare Metal DevOps, managing infrastructure and Kubernetes clusters for AI workloads.

SWQA Tools Development Engineer

NVIDIA

Senior SWQA Tools Development Engineer position at NVIDIA, focusing on certification testing and automation tool development using AI/ML technologies.

Senior AI-HPC Cluster Engineer - MLOps

NVIDIA

Job Description

Responsibilities For Senior AI-HPC Cluster Engineer - MLOps

Requirements For Senior AI-HPC Cluster Engineer - MLOps

Benefits For Senior AI-HPC Cluster Engineer - MLOps

NVIDIA

Related Jobs