Taro Logo

Senior Platform and EngOps Engineer - Cluster Operations

NVIDIA is the world leader in accelerated computing, pioneering GPU technology and AI solutions.
$144,000 - $270,250
DevOps
Senior Software Engineer
In-Person
5,000+ Employees
5+ years of experience
AI · Enterprise SaaS

Job Description

NVIDIA, the pioneer in GPU technology and accelerated computing, is seeking a Senior Platform and EngOps Engineer to join their Cluster Operations team. This role sits at the intersection of high-performance computing and artificial intelligence, where you'll be responsible for managing and optimizing large GPU clusters connected via NVLink and InfiniBand technologies.

The position offers an opportunity to work with cutting-edge technology in AI and HPC, developing and maintaining software that facilitates GPU communication and drives groundbreaking solutions. You'll be part of a team that's directly contributing to NVIDIA's mission in advancing artificial intelligence and high-performance computing capabilities.

As a Senior Platform and EngOps Engineer, you'll be responsible for developing automated tools for cluster management, implementing modern DevOps practices, and ensuring optimal cluster performance. The role requires strong technical expertise in Linux, Python, and automation tools like Ansible, combined with the ability to troubleshoot complex systems and collaborate across multiple teams.

The compensation package is competitive, with a base salary range of $144,000 - $270,250 USD (depending on level), plus equity and comprehensive benefits. This is an excellent opportunity for someone passionate about infrastructure, automation, and working with state-of-the-art GPU technology in a company that's leading the AI revolution.

The ideal candidate will bring 5+ years of hands-on experience with cluster operations, strong automation skills, and a deep understanding of operating systems and networks. Additional experience with GPU-focused hardware, Slurm scheduling, and large-scale networking would be particularly valuable. Join NVIDIA to be part of a team that's shaping the future of AI and computing technology.

Last updated 6 hours ago

Responsibilities For Senior Platform and EngOps Engineer - Cluster Operations

  • Develop automated tools to deploy, provision, and maintain GPU clusters with NVLink and InfiniBand
  • Implement DevOps tools to automate software updates and maintenance tasks
  • Manage and troubleshoot daily cluster failures and issues
  • Manage cluster software and firmware updates rollout
  • Collaborate with Engineering and Product Teams across multiple time zones

Requirements For Senior Platform and EngOps Engineer - Cluster Operations

Python
Linux
Kubernetes
  • BS or MS in Computer Science, Computer Engineering, Electrical Engineering, or related field
  • 5+ years experience in deploying and administrating clusters and infrastructure
  • Expertise in Ansible, Python and Shell Scripting
  • Deep understanding of operating systems, computer networks, and high-performance applications
  • Proven ability to work with cross-functional teams
  • Proficient with Linux fundamentals

Benefits For Senior Platform and EngOps Engineer - Cluster Operations

Equity
Medical Insurance
  • Equity
  • Medical Insurance