Site Reliability Engineering (SRE) at Google is an engineering discipline that combines software and systems engineering to build and run large-scale, massively distributed, fault-tolerant systems. As an SRE, you'll ensure Google's services have appropriate reliability and uptime while maintaining performance and capacity. The role involves creative engineering solutions to operations problems, with a focus on optimizing existing systems, building infrastructure, and automating operations work.
You'll be part of the Technical Infrastructure team, responsible for the architecture that powers Google's entire product portfolio. The role requires expertise in distributed systems, software development, and operational excellence. You'll work on designing, analyzing, and troubleshooting large-scale systems while providing technical leadership on projects.
The position offers competitive compensation ($166,000-$244,000 base salary plus bonus, equity, and benefits) and the opportunity to work with cutting-edge technology at massive scale. You'll join a culture that values diversity, intellectual curiosity, and problem-solving in a blame-free environment. The role provides both the autonomy to work on meaningful projects and the support/mentorship needed to grow professionally.
Key aspects of the role include system design consulting, capacity planning, launch reviews, monitoring system health, automation development, and incident response. You'll be responsible for the full service lifecycle, from inception and design through deployment and refinement. This is an opportunity to have a significant impact on the reliability and efficiency of Google's global infrastructure.