Google's Site Reliability Engineering (SRE) team is seeking a Staff Software Engineer to join their Colossus Site Reliability Engineering team. This role combines software and systems engineering to build and maintain large-scale, distributed, fault-tolerant systems. As an SRE, you'll ensure Google Cloud's services maintain reliability and appropriate uptime while continuously improving performance and capacity.
The position requires deep expertise in distributed systems, with a focus on optimizing existing systems, building infrastructure, and implementing automation. You'll be working on unique scale challenges specific to Google Cloud, applying your knowledge of coding, algorithms, and complex system design. The role involves both pre-deployment activities like system design consulting and post-deployment responsibilities including monitoring system health and implementing improvements.
The Technical Infrastructure team, which this role is part of, is fundamental to Google's product portfolio, developing and maintaining data centers and building next-generation Google platforms. The team takes pride in their engineering excellence and innovative problem-solving approach.
This is an excellent opportunity for experienced engineers who are passionate about large-scale systems, have strong leadership capabilities, and want to work on technology that impacts billions of users. The role offers the chance to work in a culture that values intellectual curiosity and collaboration, with opportunities to tackle complex technical challenges while growing professionally.
The position comes with Google's comprehensive benefits package and the opportunity to work with some of the industry's brightest minds in a company known for its technical innovation and global impact. You'll be part of a team that promotes self-direction while providing the support and mentorship needed for continuous learning and growth.