Principal Software Engineer- Reliability

Luma AI

Luma AI is a cutting-edge company specializing in AI technology and GPU infrastructure.

San Francisco Bay Area, CA, USA

$200,000 - $250,000

Site Reliability

Principal Software Engineer

In-Person

10+ years of experience

This job posting may no longer be active. You may be interested in these related jobs instead:

Description For Principal Software Engineer- Reliability

Luma AI is seeking a Principal Software Engineer specializing in Reliability to join their Infrastructure and Research teams. This role is crucial for managing and optimizing Luma's extensive GPU clusters, which consist of thousands of H100 GPUs across multiple providers. The ideal candidate will be responsible for ensuring cluster health, building monitoring and management tools, and solving complex performance and maintenance problems.

Key responsibilities include:

Collaborating with researchers and engineers to define infrastructure requirements
Managing and scaling GPU clusters across multiple cloud providers
Designing scalable solutions to meet increasing demands
Implementing monitoring systems and fault-tolerant designs
Building automation tools and participating in on-call rotations
Developing and maintaining service level objectives (SLOs) and indicators (SLIs)

The ideal candidate will have:

10+ years of experience as a reliability engineer, production engineer, or similar role
Strong proficiency in GPU cloud infrastructure and containerization technologies
Expertise in programming, IaC tools, and observability platforms
Excellent problem-solving and communication skills
Experience with AI/ML infrastructure (preferred)

Luma AI offers a competitive salary range of $200,000 - $250,000 per year, along with a significant equity grant. This is an exciting opportunity to work with cutting-edge technology and contribute to the growth of a rapidly scaling company in the AI space.

Last updated a year ago

Responsibilities For Principal Software Engineer- Reliability

Collaborate with researchers and engineers to specify infrastructure requirements
Work with multiple GPU cloud providers to manage and scale clusters
Design and implement scalable solutions for increasing demands
Implement and manage monitoring systems for proactive issue identification
Implement fault-tolerant and resilient design patterns
Build and maintain automation tools for system reliability
Participate in on-call rotation for 24/7 system availability
Develop and maintain service level objectives (SLOs) and indicators (SLIs)

Requirements For Principal Software Engineer- Reliability

Kubernetes

Linux

Python

10+ years of experience as a reliability engineer, production engineer, or similar role
Strong proficiency in GPU cloud infrastructure
Proficiency in programming/scripting languages
Experience with containerization technologies and orchestration platforms
Knowledge of Infrastructure as Code (IaC) tools
Excellent problem-solving and troubleshooting skills
Strong communication and collaboration skills
Experience with observability tools
Knowledge of security best practices in cloud environments
Experience as an SRE within the AI/ML space (preferred)