Taro Logo

Senior ML Platform Engineer

NVIDIA is the world leader in accelerated computing, pioneering GPU technology and innovations in AI and digital twins.
Santa Clara, CA, USAWestford, MA 01886, USAAustin, TX, USA
$184,000 - $356,500
Machine Learning
Senior Software Engineer
In-Person
5,000+ Employees
7+ years of experience
AI

Job Description

NVIDIA, the pioneer of GPU technology and leader in accelerated computing, is seeking a Senior ML Platform Engineer to drive innovation in their AI infrastructure. This role sits at the intersection of machine learning and platform engineering, focusing on building and scaling high-performance ML infrastructure used across NVIDIA's AI research and product teams. The position offers the opportunity to work with cutting-edge GPU technology and some of the world's most powerful computing systems.

The role involves architecting and optimizing ML platforms that enable scientists and engineers to train, fine-tune, and deploy advanced ML models. You'll be responsible for developing internal tools for ML workflow orchestration, managing distributed GPU clusters, and ensuring high availability of critical AI infrastructure. The position requires expertise in both ML systems and platform engineering, with hands-on experience in technologies like Kubernetes, Docker, and modern ML frameworks.

NVIDIA offers competitive compensation with a base salary range of $184,000 - $356,500 USD (depending on level), plus equity and benefits. The company provides multiple location options including Santa Clara, Westford, Austin, and Durham, allowing for geographical flexibility while working on groundbreaking AI technology.

This is an excellent opportunity for experienced engineers passionate about ML infrastructure who want to impact the future of AI computing. You'll work with top talent in the field, have access to the latest GPU technology, and help shape the platforms that power NVIDIA's AI innovation. The role combines technical depth in ML systems with the challenge of building developer-friendly platforms at scale.

Last updated 2 months ago

Responsibilities For Senior ML Platform Engineer

  • Design, build, and maintain scalable ML platforms and infrastructure for training and inference on large-scale, distributed GPU clusters
  • Develop internal tools and automation for ML workflow orchestration, resource scheduling, data access, and reproducibility
  • Collaborate with ML researchers and applied scientists to optimize performance and streamline end-to-end experimentation
  • Evolve and operate multi-cloud and hybrid environments with focus on high availability
  • Define and monitor ML-specific infrastructure metrics
  • Build tooling to support experimentation tracking and model versioning
  • Participate in on-call support for platform services
  • Drive adoption of modern GPU technologies

Requirements For Senior ML Platform Engineer

Python
Go
Rust
Kubernetes
  • BS/MS in Computer Science, Engineering, or equivalent experience
  • 7+ years in software/platform engineering, including 3+ years in ML infrastructure
  • Solid understanding of ML training/inference workflows and lifecycle
  • Proficiency in containerized workloads with Kubernetes, Docker, and workload schedulers
  • Experience with ML orchestration tools (Kubeflow, Flyte, Airflow, or Ray)
  • Strong coding skills in Python, Go, or Rust
  • Experience running Slurm or custom scheduling frameworks
  • Familiarity with GPU computing, Linux systems internals, and performance tuning

Benefits For Senior ML Platform Engineer

Equity
  • Equity
  • Benefits package available but not detailed in posting