Senior DevOps Engineer - GPU Clusters

World leader in accelerated computing, pioneering AI and digital twins technology to transform industries.
$180,000 - $339,250
DevOps
Senior Software Engineer
In-Person
7+ years of experience
AI · Enterprise SaaS

Description For Senior DevOps Engineer - GPU Clusters

NVIDIA, the pioneer in GPU technology and AI innovation, is seeking a Senior DevOps Engineer to lead their GPU clusters infrastructure. This role sits at the intersection of high-performance computing and artificial intelligence, where you'll be responsible for designing and managing large-scale GPU clusters that power cutting-edge AI workloads.

The position offers an opportunity to work with state-of-the-art technology in a company that's driving the future of AI and machine learning. You'll be managing infrastructure that supports multiple teams and projects, making a direct impact on NVIDIA's AI initiatives. The role requires expertise in cloud technologies, infrastructure automation, and high-performance computing environments.

As a Senior DevOps Engineer, you'll be responsible for ensuring the reliability and efficiency of GPU clusters, implementing best practices in infrastructure as code, and maintaining high availability for critical systems. You'll work in a multi-cloud environment, dealing with AWS, GCP, Azure, and OCI, as well as on-premises infrastructure.

The ideal candidate should have a strong background in software engineering with specific experience in GPU cluster management or similar high-performance computing environments. You'll need to be proficient in container orchestration, infrastructure automation, and have excellent problem-solving skills. The role offers competitive compensation between $180,000 and $339,250, plus equity benefits.

This is an excellent opportunity for someone passionate about infrastructure automation and operational excellence, who wants to work at the forefront of AI technology. You'll be joining a diverse and experienced team, contributing to groundbreaking developments in artificial intelligence and high-performance computing at NVIDIA.

Last updated a month ago

Responsibilities For Senior DevOps Engineer - GPU Clusters

  • Design, deploy and support large-scale, distributed GPU clusters for AI and ML workloads
  • Improve infrastructure provisioning, management, and monitoring through automation
  • Ensure high uptime and QoS through operational excellence and monitoring
  • Support multi-cloud environment (AWS, GCP, Azure, OCI) and on-prem
  • Define and implement SLOs and SLIs
  • Write RCA reports for production incidents
  • Participate in on-call rotation
  • Drive evaluation and integration of new GPU technologies

Requirements For Senior DevOps Engineer - GPU Clusters

Python
Go
Kubernetes
Linux
  • BS degree in Computer Science or equivalent experience
  • 7+ years of software engineering experience
  • 3+ years managing GPU clusters or similar environments
  • Expertise in production-level cloud services
  • Proficiency with Kubernetes, Docker, or similar tools
  • Experience in Python, Go, or Ruby
  • Strong Linux and TCP/IP knowledge
  • Proficiency in CI/CD, GitOps, and Infrastructure as Code
  • Strong communication and documentation skills

Benefits For Senior DevOps Engineer - GPU Clusters

Equity
  • Equity

Interested in this job?

Jobs Related To NVIDIA Senior DevOps Engineer - GPU Clusters

Senior DevOps Engineer

Senior DevOps Engineer role at NVIDIA focusing on infrastructure development and CI/CD implementation for DPU and Network Adapters platforms.

Senior Software Engineer - Build and Deployment Tools

Senior Software Engineer position at NVIDIA focusing on build and deployment tools development, requiring 5+ years of experience in software development and DevOps.

Senior Build and Release Methodology Engineer

Senior Build and Release Methodology Engineer role at NVIDIA, focusing on developing scalable infrastructure for SOC development and IP release processes.

Senior Linux Systems Engineer

Senior Linux Systems Engineer role at NVIDIA focusing on security, containers, and HPC infrastructure development with competitive compensation and benefits.

Senior CUDA Driver, Legate, and Build Engineer

Senior DevOps role at NVIDIA focusing on CUDA driver development, build systems, and infrastructure automation across multiple platforms.