Taro Logo

Senior Site Reliability Engineer — GPU Infrastructure

A research lab building open, state-of-the-art models for video generation towards unlocking the right brain of AGI.
Site Reliability
Senior Software Engineer
In-Person
3+ years of experience
AI

Description For Senior Site Reliability Engineer — GPU Infrastructure

Genmo, a cutting-edge research lab focused on video generation AI, is seeking a Senior Site Reliability Engineer to join their team in San Francisco. This role sits at the intersection of infrastructure and artificial intelligence, focusing on managing and optimizing GPU clusters that power frontier generative models.

The position requires a strong background in Kubernetes, infrastructure automation, and DevOps practices. You'll be responsible for designing and operating GPU clusters, implementing Infrastructure-as-Code solutions, and ensuring the reliability of critical AI infrastructure. The role combines hands-on technical work with strategic planning and leadership responsibilities.

Key technical areas include Kubernetes operations, GPU scheduling, infrastructure automation with tools like Terraform and Helm, and building robust observability systems. You'll work with high-performance computing infrastructure, including InfiniBand/RDMA networking, and be responsible for maintaining 24/7 system reliability.

The ideal candidate brings 3+ years of SRE/DevOps experience, with particular expertise in Kubernetes fleet management. While machine learning expertise is welcomed, it's not required – Genmo is committed to helping the right candidate develop these skills. This is an opportunity to shape the future of AI infrastructure at a company working on state-of-the-art video generation technology.

Working at Genmo means joining a team dedicated to pushing the boundaries of what's possible in AI, with a focus on building open, cutting-edge models. The company offers a collaborative environment where you'll work directly with researchers and engineers, making a direct impact on the future of video generation technology and AGI development.

Last updated 2 days ago

Responsibilities For Senior Site Reliability Engineer — GPU Infrastructure

  • Own the design and day-to-day operation of GPU clusters for training and serving frontier generative models
  • Lead production Kubernetes operations: GPU scheduling, cluster upgrades, multi-cluster federation
  • Define and implement Infrastructure-as-Code and GitOps workflows
  • Build CI/CD pipelines, automated testing, and rollout strategies
  • Develop observability stack and GPU telemetry
  • Optimize high-performance networking and debug performance bottlenecks
  • Run and improve 24×7 on-call rotation; lead post-incident reviews
  • Partner with researchers and engineers, communicate effectively

Requirements For Senior Site Reliability Engineer — GPU Infrastructure

Python
Kubernetes
  • BS/MS/PhD in CS, EE, or related field
  • 3+ years SRE/DevOps in production
  • 2+ years managing large Kubernetes fleets
  • Expert-level Kubernetes experience
  • Proficient in Python and Bash and IaC tools (Terraform, Helm, Ansible)
  • Track record of shipping and operating large-scale infrastructure

Interested in this job?

Jobs Related To Genmo Senior Site Reliability Engineer — GPU Infrastructure