Site Reliability Engineering (SRE) at Google Cloud combines software and systems engineering to build and run large-scale, massively distributed, fault-tolerant systems. As a Staff Software Engineer in SRE, you'll ensure that Google Cloud's services have reliability, uptime appropriate to customer's needs, and a fast rate of improvement. You'll work on optimizing existing systems, building infrastructure, and eliminating work through automation.
The role requires expertise in coding, algorithms, complexity analysis, and large-scale system design. You'll manage complex challenges of scale unique to Google Cloud while working in a culture that values diversity, intellectual curiosity, problem-solving, and openness.
Key responsibilities include engaging in the entire lifecycle of services, supporting services pre-launch, scaling systems sustainably, working on critical Google Cloud services, and solving operations problems using software engineering principles. You'll collaborate with developer teams on design, architecture, and processes.
The Technical Infrastructure team, which you'll be part of, is crucial in developing and maintaining data centers and building the next generation of Google platforms. This team ensures that Google's networks run smoothly, providing users with the best and fastest experience possible.
Ideal candidates will have experience in computing, distributed systems, storage, or networking, with strong skills in designing, analyzing, and troubleshooting large-scale distributed systems. The ability to debug, optimize code, and automate routine tasks is essential, along with excellent problem-solving and communication skills.
Join Google's SRE team to work on meaningful projects, collaborate with diverse perspectives, and contribute to the architecture that powers Google's vast product portfolio.