Staff Software Engineer, AI Reliability Engineering

Anthropic

Anthropic creates reliable, interpretable, and steerable AI systems, focusing on safe and beneficial AI development.

San Francisco, CA, USA • New York, NY, USA • Seattle, WA, USA

$320,000 - $485,000

Staff Software Engineer

Hybrid

501 - 1,000 Employees

8+ years of experience

This job posting is no longer active. Check out these related jobs instead:

Job Description

Anthropic is seeking an experienced Staff Software Engineer to join their AI Reliability Engineering team. This role is crucial for ensuring the reliability and performance of Anthropic's AI systems, both internal and customer-facing. The position combines traditional site reliability engineering with the unique challenges of AI infrastructure.

The role involves developing and maintaining service level objectives for large language model systems, implementing comprehensive monitoring solutions, and building high-availability infrastructure capable of serving millions of customers. You'll be responsible for creating automated failover systems across multiple regions and cloud providers, leading incident response for critical AI services, and optimizing costs for large-scale AI infrastructure.

The ideal candidate brings extensive experience in distributed systems observability and monitoring at scale, with a deep understanding of AI infrastructure operations. You should be comfortable working with both traditional infrastructure metrics and AI-specific performance indicators. Experience with chaos engineering, resilience testing, and maintaining SLO/SLA frameworks is essential.

Anthropic offers a competitive compensation package ranging from $320,000 to $485,000 USD, along with benefits including equity donation matching, generous vacation and parental leave, and flexible working hours. The position is hybrid, requiring at least 25% time in one of their offices in San Francisco, New York City, or Seattle.

The company is committed to developing safe and beneficial AI systems, working as a cohesive team on large-scale research efforts. They value impact-focused work and view AI research as an empirical science. The collaborative environment includes frequent research discussions and emphasizes effective communication skills.

This is an opportunity to play a crucial role in ensuring the reliability and safety of cutting-edge AI systems while working with a team dedicated to beneficial AI development. The position offers the chance to work on unprecedented technical challenges while contributing to Anthropic's mission of creating reliable, interpretable, and steerable AI systems.

Last updated 4 months ago

Responsibilities For Staff Software Engineer, AI Reliability Engineering

Develop Service Level Objectives for large language model serving and training systems
Design and implement monitoring systems for availability, latency and other metrics
Design and implement high-availability language model serving infrastructure
Develop automated failover and recovery systems across multiple regions and cloud providers
Lead incident response for critical AI services
Build and maintain cost optimization systems for large-scale AI infrastructure

Requirements For Staff Software Engineer, AI Reliability Engineering

Kubernetes

Linux

Extensive experience with distributed systems observability and monitoring at scale
Understanding of AI infrastructure operations
Experience implementing and maintaining SLO/SLA frameworks
Comfort with traditional and AI-specific metrics
Experience with chaos engineering and resilience testing
Ability to bridge ML engineers and infrastructure teams
Excellent communication skills
Bachelor's degree in related field or equivalent experience

Benefits For Staff Software Engineer, AI Reliability Engineering

Visa Sponsorship

Competitive compensation and benefits
Optional equity donation matching
Generous vacation and parental leave
Flexible working hours
Office space for collaboration
Visa sponsorship available