Senior Platform and EngOps Engineer - Cluster Operations

NVIDIA

NVIDIA is the world leader in accelerated computing, pioneering GPU technology and AI solutions.

Santa Clara, CA, USA

$144,000 - $270,250

DevOps

Senior Software Engineer

In-Person

5,000+ Employees

5+ years of experience

AI · Enterprise SaaS

Job Description

NVIDIA, the pioneer in GPU technology and accelerated computing, is seeking a Senior Platform and EngOps Engineer to join their Cluster Operations team. This role sits at the intersection of high-performance computing and artificial intelligence, where you'll be responsible for managing and optimizing large GPU clusters connected via NVLink and InfiniBand technologies.

The position offers an opportunity to work with cutting-edge technology in AI and HPC, developing and maintaining software that facilitates GPU communication and drives groundbreaking solutions. You'll be part of a team that's directly contributing to NVIDIA's mission in advancing artificial intelligence and high-performance computing capabilities.

As a Senior Platform and EngOps Engineer, you'll be responsible for developing automated tools for cluster management, implementing modern DevOps practices, and ensuring optimal cluster performance. The role requires strong technical expertise in Linux, Python, and automation tools like Ansible, combined with the ability to troubleshoot complex systems and collaborate across multiple teams.

The compensation package is competitive, with a base salary range of $144,000 - $270,250 USD (depending on level), plus equity and comprehensive benefits. This is an excellent opportunity for someone passionate about infrastructure, automation, and working with state-of-the-art GPU technology in a company that's leading the AI revolution.

The ideal candidate will bring 5+ years of hands-on experience with cluster operations, strong automation skills, and a deep understanding of operating systems and networks. Additional experience with GPU-focused hardware, Slurm scheduling, and large-scale networking would be particularly valuable. Join NVIDIA to be part of a team that's shaping the future of AI and computing technology.

Last updated 6 hours ago

Responsibilities For Senior Platform and EngOps Engineer - Cluster Operations

Develop automated tools to deploy, provision, and maintain GPU clusters with NVLink and InfiniBand
Implement DevOps tools to automate software updates and maintenance tasks
Manage and troubleshoot daily cluster failures and issues
Manage cluster software and firmware updates rollout
Collaborate with Engineering and Product Teams across multiple time zones

Requirements For Senior Platform and EngOps Engineer - Cluster Operations

Python

Linux

Kubernetes

BS or MS in Computer Science, Computer Engineering, Electrical Engineering, or related field
5+ years experience in deploying and administrating clusters and infrastructure
Expertise in Ansible, Python and Shell Scripting
Deep understanding of operating systems, computer networks, and high-performance applications
Proven ability to work with cross-functional teams
Proficient with Linux fundamentals

Benefits For Senior Platform and EngOps Engineer - Cluster Operations

Equity

Medical Insurance

Equity
Medical Insurance