Taro Logo

Machine Learning, Platform Engineer

A research-driven artificial intelligence company focused on lowering the cost of modern AI systems through co-designing software, hardware, algorithms, and models.
$160,000 - $250,000
Machine Learning
Senior Software Engineer
Hybrid
5+ years of experience
AI

Job Description

Together AI is seeking a Machine Learning Platform Engineer to join their innovative team in San Francisco. This role is crucial in enabling custom models and dedicated inference on Together's platform, with a focus on optimizing autoscaling, minimizing cold starts, and achieving optimal model performance.

The position requires a seasoned professional with 5+ years of experience in building large-scale distributed systems. You'll be working at the intersection of machine learning infrastructure and platform engineering, utilizing technologies like Kubernetes, Terraform, and various programming languages including Go, Rust, and Python.

As a Platform Engineer, you'll be responsible for critical infrastructure components including multi-cluster orchestration, predictive autoscaling, and API development. The role offers an opportunity to work on cutting-edge AI infrastructure, contributing to Together AI's mission of making AI systems more accessible and cost-effective.

The company has made significant contributions to open-source research, including developments like FlashAttention, Hyena, FlexGen, and RedPajama. They offer a competitive compensation package ranging from $160,000 to $250,000, plus equity and comprehensive benefits.

This is an excellent opportunity for experienced engineers who are passionate about AI infrastructure and want to make a meaningful impact in the field of artificial intelligence. The hybrid work environment requires four days per week in the SF office, providing a balance between collaborative in-person work and flexibility.

The ideal candidate will have strong expertise in distributed systems, excellent understanding of operating systems concepts, and proven experience with containerization and infrastructure as code. You'll be joining a research-driven company that values open and transparent AI systems, working alongside passionate researchers and engineers to advance the frontier of AI technology.

Last updated 3 days ago

Responsibilities For Machine Learning, Platform Engineer

  • Work on multi-cluster orchestration, portfolio optimization, predictive autoscaling, control panes
  • Model bring-up, light model optimization, APIs for managing deployments
  • Analyze and improve robustness and scalability of existing distributed systems
  • Partner with product teams to understand functional requirements
  • Write clear, well-tested, and maintainable software and IaC
  • Conduct design and code reviews, create documentation, and develop testing strategies

Requirements For Machine Learning, Platform Engineer

Python
Go
Rust
Kubernetes
  • 5+ years of demonstrated experience in building large scale, fault tolerant, distributed systems and API microservices
  • Experience running serverless inference platforms, doing model bring-up on short notice, being on call
  • Experience designing, analyzing and improving efficiency, scalability, and stability of system resources
  • Excellent understanding of low level operating systems concepts
  • Expert-level programmer in Golang, Rust, Python, C++, or Haskell
  • Proficiency in Infrastructure as Code (IaC) using tools like Terraform
  • Experience with Kubernetes or other container orchestration systems
  • Bachelor's or Master's degree in Computer Science, Computer Engineering, or related field

Benefits For Machine Learning, Platform Engineer

Medical Insurance
Equity
  • Competitive compensation
  • Startup equity
  • Health insurance
  • Competitive benefits

Related Jobs