Site Reliability Engineer

Runloop

Runloop builds foundational infrastructure for AI development, providing engineers and data scientists with fast, secure, and reproducible code sandboxes.

San Francisco, CA, USA

Site Reliability

Senior Software Engineer

In-Person

5+ years of experience

AI · Enterprise SaaS

Job Description

Runloop is at the forefront of AI development infrastructure, providing essential tools for AI engineers and data scientists through their secure and efficient code sandbox platform. As a Site Reliability Engineer at Runloop, you'll play a crucial role in maintaining and enhancing the reliability, security, and performance of their core platform.

The position combines deep operational knowledge with software engineering expertise, making it perfect for those who enjoy working at the intersection of infrastructure and code. You'll be responsible for designing and maintaining production infrastructure on major cloud platforms, implementing robust monitoring systems, and ensuring high availability for all users.

The role offers significant technical challenges, working with cutting-edge technologies like Kubernetes, Docker, and modern observability tools. You'll be part of a small but impactful team, directly influencing the platform's architecture and reliability practices. The position requires both technical expertise and leadership skills, as you'll be mentoring other engineers and driving best practices across the organization.

Working at Runloop means being at the heart of the AI revolution in software development. The company offers competitive compensation, comprehensive benefits, and a hybrid work environment in San Francisco. This is an excellent opportunity for an experienced SRE who wants to make a meaningful impact on the future of AI development tools while working with modern technologies and practices.

The ideal candidate will bring strong programming skills, deep infrastructure knowledge, and experience with distributed systems. You'll need to be comfortable with incident response, on-call duties, and cross-functional collaboration. If you're passionate about building reliable systems and want to be part of shaping the future of AI-driven software engineering, this role offers the perfect blend of challenge and opportunity.

Last updated 6 hours ago

Responsibilities For Site Reliability Engineer

Design and maintain production infrastructure on cloud platforms (AWS, GCP, Azure)
Monitor and respond to system alerts and incidents using Grafana, Prometheus
Collaborate with developers on scalable and reliable feature design
Troubleshoot infrastructure, networking, and sandbox environment issues
Participate in on-call rotation
Define and track SLIs/SLOs and manage error budgets
Automate deployments, scaling, provisioning, and recovery tasks
Lead incident response and conduct root-cause analysis
Plan for capacity growth and forecast system usage
Mentor front-end developers in building reliable distributed systems

Requirements For Site Reliability Engineer

Python

Kubernetes

Linux

Proven experience as an SRE, DevOps Engineer, or similar role
Strong programming skills in Python or Go
Deep expertise in Docker and Kubernetes
Experience with cloud infrastructure and tools like Terraform/Pulumi
Familiarity with monitoring tools like Prometheus, Grafana, or Datadog
Solid understanding of networking, security, and Linux administration
Experience designing, scaling, and maintaining distributed systems
Proficiency in implementing observability frameworks
Experience managing incidents and running on-call operations
Ability to mentor engineers and influence reliability practices

Benefits For Site Reliability Engineer

Medical Insurance

Dental Insurance

Vision Insurance

Equity

Competitive salary and equity
Comprehensive health, dental, and vision insurance for you and dependents
Opportunity to work on cutting-edge AI technology
Free lunch and snacks
Optional 1 day WFH per week

Runloop

Runloop builds foundational infrastructure for AI development, providing engineers and data scientists with fast, secure, and reproducible code sandboxes.

San Francisco, CA, USA

Site Reliability

Senior Software Engineer

In-Person

5+ years of experience

AI · Enterprise SaaS

Related Jobs

Senior Reliability Engineer

Google

Senior Reliability Engineer position at Google focusing on hardware reliability, testing, and quality assurance for consumer electronics products.

Senior Software Engineer, Site Reliability Engineering

Google

Senior SRE position at Google focusing on building and maintaining large-scale distributed systems with competitive compensation and comprehensive benefits.

Senior Software Developer, Site Reliability Development

Google

Senior Software Developer role at Google focusing on Site Reliability Development, building and maintaining large-scale distributed systems with competitive compensation and benefits.

Senior Site Reliability Engineer, Cloud

NVIDIA

Senior Site Reliability Engineer position at NVIDIA focusing on cloud infrastructure, Kubernetes, and maintaining large-scale production systems with competitive compensation and remote work options.

Senior Site Reliability Engineer

Apple

Senior Site Reliability Engineer position at Apple working on satellite connectivity infrastructure for emergency communications services.