Microsoft's Azure Data engineering team is seeking a Principal Site Reliability Engineer to join their databases team, focusing on Azure Cosmos DB - a globally distributed, massively scalable, multi-model cloud database service. This role combines technical expertise with service reliability to maintain Microsoft's operational Database systems.
The position offers an opportunity to work with cutting-edge technology in a team that operates like a startup while being part of one of the world's largest tech companies. You'll be responsible for ensuring 99.99% availability and <10ms latency SLAs for critical systems used in Healthcare, Retail, Telecommunications, and IoT applications.
As a Principal SRE, you'll focus on automating root cause analysis and issue mitigation, often addressing problems before they impact customers. The role requires a data-driven approach to solving Service Reliability problems, analyzing massive amounts of telemetry, and implementing automated solutions to maintain service level objectives (SLOs).
The position offers competitive compensation ($139,900 - $274,800 base salary range, higher in SF and NYC areas) and comprehensive benefits including healthcare, educational resources, savings plans, and parental leave. You'll be part of Microsoft's inclusive culture that values diverse perspectives and collaborative problem-solving.
Key responsibilities include building automation solutions, collaborating with customers on supportability issues, implementing service telemetry, and providing operational insights to product teams. The ideal candidate will have 6+ years of technical engineering experience, strong coding skills, and extensive experience with large-scale cloud services.
This is an excellent opportunity for a seasoned SRE professional who wants to make a significant impact on one of Microsoft's fastest-growing Azure services while working with cutting-edge cloud technology and contributing to systems that serve millions of users worldwide.