Anyscale, backed by prominent investors with $250+ million in funding, is revolutionizing distributed computing through Ray, their open-source project. They're building a platform that allows developers and data scientists to scale ML applications from laptop to cluster without deep distributed systems expertise.
As a Site Reliability Engineer at Anyscale, you'll be instrumental in maintaining the reliability and performance of their production systems and user-facing services. The role combines engineering excellence with operational expertise, focusing on building robust systems for monitoring, observability, and deployment automation.
Key responsibilities include developing a comprehensive view of cloud component utilization, implementing effective deployment methodologies, and building sophisticated monitoring and alerting systems. You'll also be responsible for establishing testing infrastructure and defining organization-wide SLOs.
The position offers an attractive compensation package ranging from $180.6K to $200.9K, complemented by equity and comprehensive benefits including healthcare, 401k, and various stipends. The hybrid work environment in either San Francisco or Palo Alto provides flexibility while maintaining collaborative opportunities.
This role is perfect for experienced SREs who want to work at the intersection of distributed systems and ML infrastructure, helping shape the future of AI application deployment. You'll be joining a company that powers the ML infrastructure of major tech companies like OpenAI, Uber, and Spotify, making a significant impact on the AI ecosystem.
The ideal candidate should have at least 3 years of relevant experience and a passion for building reliable, scalable systems. Anyscale values diversity and inclusion, welcoming applications from all backgrounds and providing equal opportunities for growth and success.