Runloop is at the forefront of AI development infrastructure, providing essential tools for AI engineers and data scientists through their secure and efficient code sandbox platform. As a Site Reliability Engineer at Runloop, you'll play a crucial role in maintaining and enhancing the reliability, security, and performance of their core platform.
The position combines deep operational knowledge with software engineering expertise, making it perfect for those who enjoy working at the intersection of infrastructure and code. You'll be responsible for designing and maintaining production infrastructure on major cloud platforms, implementing robust monitoring systems, and ensuring high availability for all users.
The role offers significant technical challenges, working with cutting-edge technologies like Kubernetes, Docker, and modern observability tools. You'll be part of a small but impactful team, directly influencing the platform's architecture and reliability practices. The position requires both technical expertise and leadership skills, as you'll be mentoring other engineers and driving best practices across the organization.
Working at Runloop means being at the heart of the AI revolution in software development. The company offers competitive compensation, comprehensive benefits, and a hybrid work environment in San Francisco. This is an excellent opportunity for an experienced SRE who wants to make a meaningful impact on the future of AI development tools while working with modern technologies and practices.
The ideal candidate will bring strong programming skills, deep infrastructure knowledge, and experience with distributed systems. You'll need to be comfortable with incident response, on-call duties, and cross-functional collaboration. If you're passionate about building reliable systems and want to be part of shaping the future of AI-driven software engineering, this role offers the perfect blend of challenge and opportunity.