Taro Logo

Principal Software Engineer- Reliability

Luma AI is a cutting-edge company specializing in AI technology and GPU infrastructure.
$200,000 - $250,000
Site Reliability
Principal Software Engineer
In-Person
10+ years of experience
This job posting may no longer be active. You may be interested in these related jobs instead:

Description For Principal Software Engineer- Reliability

Luma AI is seeking a Principal Software Engineer specializing in Reliability to join their Infrastructure and Research teams. This role is crucial for managing and optimizing Luma's extensive GPU clusters, which consist of thousands of H100 GPUs across multiple providers. The ideal candidate will be responsible for ensuring cluster health, building monitoring and management tools, and solving complex performance and maintenance problems.

Key responsibilities include:

  • Collaborating with researchers and engineers to define infrastructure requirements
  • Managing and scaling GPU clusters across multiple cloud providers
  • Designing scalable solutions to meet increasing demands
  • Implementing monitoring systems and fault-tolerant designs
  • Building automation tools and participating in on-call rotations
  • Developing and maintaining service level objectives (SLOs) and indicators (SLIs)

The ideal candidate will have:

  • 10+ years of experience as a reliability engineer, production engineer, or similar role
  • Strong proficiency in GPU cloud infrastructure and containerization technologies
  • Expertise in programming, IaC tools, and observability platforms
  • Excellent problem-solving and communication skills
  • Experience with AI/ML infrastructure (preferred)

Luma AI offers a competitive salary range of $200,000 - $250,000 per year, along with a significant equity grant. This is an exciting opportunity to work with cutting-edge technology and contribute to the growth of a rapidly scaling company in the AI space.

Last updated a year ago

Responsibilities For Principal Software Engineer- Reliability

  • Collaborate with researchers and engineers to specify infrastructure requirements
  • Work with multiple GPU cloud providers to manage and scale clusters
  • Design and implement scalable solutions for increasing demands
  • Implement and manage monitoring systems for proactive issue identification
  • Implement fault-tolerant and resilient design patterns
  • Build and maintain automation tools for system reliability
  • Participate in on-call rotation for 24/7 system availability
  • Develop and maintain service level objectives (SLOs) and indicators (SLIs)

Requirements For Principal Software Engineer- Reliability

Kubernetes
Linux
Python
  • 10+ years of experience as a reliability engineer, production engineer, or similar role
  • Strong proficiency in GPU cloud infrastructure
  • Proficiency in programming/scripting languages
  • Experience with containerization technologies and orchestration platforms
  • Knowledge of Infrastructure as Code (IaC) tools
  • Excellent problem-solving and troubleshooting skills
  • Strong communication and collaboration skills
  • Experience with observability tools
  • Knowledge of security best practices in cloud environments
  • Experience as an SRE within the AI/ML space (preferred)