Senior Site Reliability Engineer — GPU Infrastructure

Genmo

A research lab building open, state-of-the-art models for video generation towards unlocking the right brain of AGI.

San Francisco, CA, USA

Site Reliability

Senior Software Engineer

In-Person

3+ years of experience

Description For Senior Site Reliability Engineer — GPU Infrastructure

Genmo, a cutting-edge research lab focused on video generation AI, is seeking a Senior Site Reliability Engineer to join their team in San Francisco. This role sits at the intersection of infrastructure and artificial intelligence, focusing on managing and optimizing GPU clusters that power frontier generative models.

The position requires a strong background in Kubernetes, infrastructure automation, and DevOps practices. You'll be responsible for designing and operating GPU clusters, implementing Infrastructure-as-Code solutions, and ensuring the reliability of critical AI infrastructure. The role combines hands-on technical work with strategic planning and leadership responsibilities.

Key technical areas include Kubernetes operations, GPU scheduling, infrastructure automation with tools like Terraform and Helm, and building robust observability systems. You'll work with high-performance computing infrastructure, including InfiniBand/RDMA networking, and be responsible for maintaining 24/7 system reliability.

The ideal candidate brings 3+ years of SRE/DevOps experience, with particular expertise in Kubernetes fleet management. While machine learning expertise is welcomed, it's not required – Genmo is committed to helping the right candidate develop these skills. This is an opportunity to shape the future of AI infrastructure at a company working on state-of-the-art video generation technology.

Working at Genmo means joining a team dedicated to pushing the boundaries of what's possible in AI, with a focus on building open, cutting-edge models. The company offers a collaborative environment where you'll work directly with researchers and engineers, making a direct impact on the future of video generation technology and AGI development.

Last updated 2 days ago

Responsibilities For Senior Site Reliability Engineer — GPU Infrastructure

Own the design and day-to-day operation of GPU clusters for training and serving frontier generative models
Lead production Kubernetes operations: GPU scheduling, cluster upgrades, multi-cluster federation
Define and implement Infrastructure-as-Code and GitOps workflows
Build CI/CD pipelines, automated testing, and rollout strategies
Develop observability stack and GPU telemetry
Optimize high-performance networking and debug performance bottlenecks
Run and improve 24×7 on-call rotation; lead post-incident reviews
Partner with researchers and engineers, communicate effectively

Requirements For Senior Site Reliability Engineer — GPU Infrastructure

Python

Kubernetes

BS/MS/PhD in CS, EE, or related field
3+ years SRE/DevOps in production
2+ years managing large Kubernetes fleets
Expert-level Kubernetes experience
Proficient in Python and Bash and IaC tools (Terraform, Helm, Ansible)
Track record of shipping and operating large-scale infrastructure

Genmo

A research lab building open, state-of-the-art models for video generation towards unlocking the right brain of AGI.

San Francisco, CA, USA

Site Reliability

Senior Software Engineer

In-Person

3+ years of experience

Interested in this job?

Senior Site Reliability Engineer — GPU Infrastructure

Genmo

Description For Senior Site Reliability Engineer — GPU Infrastructure

Responsibilities For Senior Site Reliability Engineer — GPU Infrastructure

Requirements For Senior Site Reliability Engineer — GPU Infrastructure

Genmo

Jobs Related To Genmo Senior Site Reliability Engineer — GPU Infrastructure