Senior ML Platform Engineer

NVIDIA

NVIDIA is the world leader in accelerated computing, pioneering GPU technology and innovations in AI and digital twins.

Santa Clara, CA, USA • Westford, MA 01886, USA • Austin, TX, USA…

$184,000 - $356,500

Machine Learning

Senior Software Engineer

In-Person

5,000+ Employees

7+ years of experience

Job Description

NVIDIA, the pioneer of GPU technology and leader in accelerated computing, is seeking a Senior ML Platform Engineer to drive innovation in their AI infrastructure. This role sits at the intersection of machine learning and platform engineering, focusing on building and scaling high-performance ML infrastructure used across NVIDIA's AI research and product teams. The position offers the opportunity to work with cutting-edge GPU technology and some of the world's most powerful computing systems.

The role involves architecting and optimizing ML platforms that enable scientists and engineers to train, fine-tune, and deploy advanced ML models. You'll be responsible for developing internal tools for ML workflow orchestration, managing distributed GPU clusters, and ensuring high availability of critical AI infrastructure. The position requires expertise in both ML systems and platform engineering, with hands-on experience in technologies like Kubernetes, Docker, and modern ML frameworks.

NVIDIA offers competitive compensation with a base salary range of $184,000 - $356,500 USD (depending on level), plus equity and benefits. The company provides multiple location options including Santa Clara, Westford, Austin, and Durham, allowing for geographical flexibility while working on groundbreaking AI technology.

This is an excellent opportunity for experienced engineers passionate about ML infrastructure who want to impact the future of AI computing. You'll work with top talent in the field, have access to the latest GPU technology, and help shape the platforms that power NVIDIA's AI innovation. The role combines technical depth in ML systems with the challenge of building developer-friendly platforms at scale.

Last updated 2 months ago

Responsibilities For Senior ML Platform Engineer

Design, build, and maintain scalable ML platforms and infrastructure for training and inference on large-scale, distributed GPU clusters
Develop internal tools and automation for ML workflow orchestration, resource scheduling, data access, and reproducibility
Collaborate with ML researchers and applied scientists to optimize performance and streamline end-to-end experimentation
Evolve and operate multi-cloud and hybrid environments with focus on high availability
Define and monitor ML-specific infrastructure metrics
Build tooling to support experimentation tracking and model versioning
Participate in on-call support for platform services
Drive adoption of modern GPU technologies

Requirements For Senior ML Platform Engineer

Python

Rust

Kubernetes

BS/MS in Computer Science, Engineering, or equivalent experience
7+ years in software/platform engineering, including 3+ years in ML infrastructure
Solid understanding of ML training/inference workflows and lifecycle
Proficiency in containerized workloads with Kubernetes, Docker, and workload schedulers
Experience with ML orchestration tools (Kubeflow, Flyte, Airflow, or Ray)
Strong coding skills in Python, Go, or Rust
Experience running Slurm or custom scheduling frameworks
Familiarity with GPU computing, Linux systems internals, and performance tuning

Benefits For Senior ML Platform Engineer

Equity

Equity
Benefits package available but not detailed in posting