Genmo, a cutting-edge research lab focused on video generation AI, is seeking a Senior Site Reliability Engineer to join their team in San Francisco. This role sits at the intersection of infrastructure and artificial intelligence, focusing on managing and optimizing GPU clusters that power frontier generative models.
The position requires a strong background in Kubernetes, infrastructure automation, and DevOps practices. You'll be responsible for designing and operating GPU clusters, implementing Infrastructure-as-Code solutions, and ensuring the reliability of critical AI infrastructure. The role combines hands-on technical work with strategic planning and leadership responsibilities.
Key technical areas include Kubernetes operations, GPU scheduling, infrastructure automation with tools like Terraform and Helm, and building robust observability systems. You'll work with high-performance computing infrastructure, including InfiniBand/RDMA networking, and be responsible for maintaining 24/7 system reliability.
The ideal candidate brings 3+ years of SRE/DevOps experience, with particular expertise in Kubernetes fleet management. While machine learning expertise is welcomed, it's not required – Genmo is committed to helping the right candidate develop these skills. This is an opportunity to shape the future of AI infrastructure at a company working on state-of-the-art video generation technology.
Working at Genmo means joining a team dedicated to pushing the boundaries of what's possible in AI, with a focus on building open, cutting-edge models. The company offers a collaborative environment where you'll work directly with researchers and engineers, making a direct impact on the future of video generation technology and AGI development.