Taro Logo

Site Reliability Engineer

Together AI is a research-driven artificial intelligence company focused on open and transparent AI systems, aiming to lower the cost of modern AI systems through co-designing software, hardware, algorithms, and models.
$160,000 - $230,000
Site Reliability
Senior Software Engineer
Hybrid
11 - 50 Employees
7+ years of experience
This job posting is no longer active. 😔

Job Description

As a Site Reliability Engineer (SRE) at Together AI, you will be responsible for maintaining all user-facing services and production systems. This role combines pragmatic operations with software engineering, applying sound engineering principles, operational discipline, and mature automation to our operating environments and codebase.

You will specialize in systems (operating systems, storage subsystems, networking) while implementing best practices for availability, reliability, and scalability. Your varied interests in algorithms and distributed systems will be valuable in this role.

Key responsibilities include:

  • Participating in an on-call (PagerDuty) rotation to respond to incidents impacting availability
  • Building and running infrastructure using Ansible, Terraform, and Kubernetes to enable scaling for a massive number of concurrent users
  • Developing monitoring systems to ensure the highest quality service for customers
  • Designing and implementing operational processes such as deployments and upgrades
  • Debugging production issues across all services and stack levels
  • Identifying improvements for product architecture from reliability, performance, and availability perspectives
  • Planning the growth of Together AI's infrastructure

Together AI is at the forefront of AI research and development, contributing to open-source research, models, and datasets. The company has been behind technological advancements such as FlashAttention, Hyena, FlexGen, and RedPajama. This role offers an opportunity to join a passionate team of researchers and engineers in building the next generation of AI infrastructure.

The position offers competitive compensation, including a base salary range of $160,000 - $230,000, startup equity, health insurance, and other benefits. Together AI is an Equal Opportunity Employer, providing equal employment opportunities regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, and more.

If you're passionate about AI infrastructure and have the skills to keep complex systems running smoothly at scale, this role at Together AI could be an excellent opportunity to make a significant impact in the field of artificial intelligence.

Last updated a year ago

Responsibilities For Site Reliability Engineer

  • Be on an on-call (PagerDuty) rotation to respond to incidents that impact availability
  • Build and run infrastructure with Ansible, Terraform, and Kubernetes to enable scaling to a massive number of concurrent users
  • Build monitoring systems to ensure the highest quality service for customers
  • Design and implement operational processes (such as deployments and upgrades)
  • Debug production issues across all services and levels of the stack
  • Identify improvements for the product architecture from the reliability, performance and availability perspectives
  • Plan the growth of Together AI's infrastructure

Requirements For Site Reliability Engineer

Kubernetes
Linux
  • 7+ years of professional SRE or related experience
  • Bachelor's degree in Computer Science or related field or equivalent work experience
  • Expert knowledge of Ansible (roles, playbooks), Terraform, and Kubernetes
  • Proficiency in programming/scripting languages
  • Direct experience in monitoring and observability practices
  • Advanced knowledge of cloud services
  • Ability to thrive in a collaborative environment involving different stakeholders and subject matter experts

Benefits For Site Reliability Engineer

Medical Insurance
  • Startup equity
  • Health insurance