Site Reliability Engineering (SRE) at Google combines software and systems engineering to build and run large-scale, massively distributed, fault-tolerant systems. As a Senior SRE focusing on Cloud Incident Response, you'll be responsible for ensuring Google Cloud Platform's stability and reliability through critical incident support and continuous improvement. You'll work on building systems and tooling to improve visibility into Cloud state, detection of large-scale issues, and communications with stakeholders. The role requires expertise in distributed systems, incident management, and technical leadership.
The position is part of Google's Technical Infrastructure team, which builds and maintains the architecture behind Google's product portfolio. You'll be working with a team that takes pride in being the engineers' engineers, focusing on keeping networks running optimally and ensuring the best possible user experience.
The ideal candidate will bring strong experience in software development, distributed systems, and incident management, combined with excellent problem-solving and communication skills. You'll have the opportunity to work on unique challenges of scale specific to Google Cloud while contributing to the development of processes and tools that enhance platform reliability.
This role offers the chance to work with Google's SRE culture of intellectual curiosity and problem-solving, in an environment that encourages collaboration and big thinking. You'll be part of an organization that brings together diverse perspectives and backgrounds, promoting self-direction while providing support and mentorship for growth and learning.