Taro Logo

Site Reliability Engineer

Runloop builds foundational infrastructure for AI development, providing engineers and data scientists with fast, secure, and reproducible code sandboxes.
Site Reliability
Senior Software Engineer
In-Person
5+ years of experience
AI · Enterprise SaaS

Job Description

Runloop is at the forefront of AI development infrastructure, providing essential tools for AI engineers and data scientists through their secure and efficient code sandbox platform. As a Site Reliability Engineer at Runloop, you'll play a crucial role in maintaining and enhancing the reliability, security, and performance of their core platform.

The position combines deep operational knowledge with software engineering expertise, making it perfect for those who enjoy working at the intersection of infrastructure and code. You'll be responsible for designing and maintaining production infrastructure on major cloud platforms, implementing robust monitoring systems, and ensuring high availability for all users.

The role offers significant technical challenges, working with cutting-edge technologies like Kubernetes, Docker, and modern observability tools. You'll be part of a small but impactful team, directly influencing the platform's architecture and reliability practices. The position requires both technical expertise and leadership skills, as you'll be mentoring other engineers and driving best practices across the organization.

Working at Runloop means being at the heart of the AI revolution in software development. The company offers competitive compensation, comprehensive benefits, and a hybrid work environment in San Francisco. This is an excellent opportunity for an experienced SRE who wants to make a meaningful impact on the future of AI development tools while working with modern technologies and practices.

The ideal candidate will bring strong programming skills, deep infrastructure knowledge, and experience with distributed systems. You'll need to be comfortable with incident response, on-call duties, and cross-functional collaboration. If you're passionate about building reliable systems and want to be part of shaping the future of AI-driven software engineering, this role offers the perfect blend of challenge and opportunity.

Last updated 6 hours ago

Responsibilities For Site Reliability Engineer

  • Design and maintain production infrastructure on cloud platforms (AWS, GCP, Azure)
  • Monitor and respond to system alerts and incidents using Grafana, Prometheus
  • Collaborate with developers on scalable and reliable feature design
  • Troubleshoot infrastructure, networking, and sandbox environment issues
  • Participate in on-call rotation
  • Define and track SLIs/SLOs and manage error budgets
  • Automate deployments, scaling, provisioning, and recovery tasks
  • Lead incident response and conduct root-cause analysis
  • Plan for capacity growth and forecast system usage
  • Mentor front-end developers in building reliable distributed systems

Requirements For Site Reliability Engineer

Python
Go
Kubernetes
Linux
  • Proven experience as an SRE, DevOps Engineer, or similar role
  • Strong programming skills in Python or Go
  • Deep expertise in Docker and Kubernetes
  • Experience with cloud infrastructure and tools like Terraform/Pulumi
  • Familiarity with monitoring tools like Prometheus, Grafana, or Datadog
  • Solid understanding of networking, security, and Linux administration
  • Experience designing, scaling, and maintaining distributed systems
  • Proficiency in implementing observability frameworks
  • Experience managing incidents and running on-call operations
  • Ability to mentor engineers and influence reliability practices

Benefits For Site Reliability Engineer

Medical Insurance
Dental Insurance
Vision Insurance
Equity
  • Competitive salary and equity
  • Comprehensive health, dental, and vision insurance for you and dependents
  • Opportunity to work on cutting-edge AI technology
  • Free lunch and snacks
  • Optional 1 day WFH per week

Related Jobs

Senior Reliability Engineer

Senior Reliability Engineer position at Google focusing on hardware reliability, testing, and quality assurance for consumer electronics products.

Senior Software Engineer, Site Reliability Engineering

Senior SRE position at Google focusing on building and maintaining large-scale distributed systems with competitive compensation and comprehensive benefits.

Senior Software Developer, Site Reliability Development

Senior Software Developer role at Google focusing on Site Reliability Development, building and maintaining large-scale distributed systems with competitive compensation and benefits.

Senior Site Reliability Engineer, Cloud

Senior Site Reliability Engineer position at NVIDIA focusing on cloud infrastructure, Kubernetes, and maintaining large-scale production systems with competitive compensation and remote work options.

Senior Site Reliability Engineer

Senior Site Reliability Engineer position at Apple working on satellite connectivity infrastructure for emergency communications services.