Principal Site Reliability Engineer

Microsoft is a company that builds cloud platforms, software, and services, leading the transformation of analytics in the world of data with products like Microsoft Fabric, Azure SQL DB, Azure Cosmos DB, and more.
$137,600 - $267,000
Site Reliability
Principal Software Engineer
Remote
5,000+ Employees
8+ years of experience
Enterprise SaaS · Cloud

Description For Principal Site Reliability Engineer

Microsoft's Azure Data engineering team is seeking a Principal Site Reliability Engineer to join their databases team, focusing on Azure Cosmos DB - a globally distributed, massively scalable, multi-model cloud database service. This role combines software engineering excellence with operational expertise to ensure high availability and performance of critical cloud infrastructure.

The position offers an opportunity to work with cutting-edge technology while maintaining stringent Service Level Objectives (SLOs) for one of Azure's fastest-growing services. You'll be responsible for building and optimizing solutions that analyze massive amounts of telemetry and service health indicators in near real-time, performing automated root cause analysis, and implementing necessary mitigations to maintain service reliability.

As a Principal SRE, you'll work at the intersection of development and operations, focusing on making on-call engineering more efficient through automation and proactive problem-solving. The role involves collaboration with both engineering teams and enterprise customers, requiring strong technical communication skills and a data-driven approach to problem-solving.

The position offers competitive compensation ($137,600 - $267,000 base salary range), comprehensive benefits, and the opportunity to work remotely. You'll be part of a team that values innovation, inclusion, and maintaining a growth mindset while building systems that power some of the largest companies in healthcare, retail, telecommunications, and IoT sectors.

This role is perfect for someone who combines deep technical expertise with a passion for service reliability, automated problem-solving, and customer success. You'll have the chance to influence product architecture and roadmap while ensuring supportability remains a key consideration in product evolution.

Join Microsoft's Azure Data team to help build the data platform for the age of AI, working with a talented team that operates with the agility of a startup while backed by the resources and stability of a global technology leader.

Last updated 5 days ago

Responsibilities For Principal Site Reliability Engineer

  • Collaborating with engineering teams on building and enhancing tooling and automation solutions
  • Working with customers to understand pain points around Supportability and SLO attainment
  • Communicate on technical level and interface with enterprise customers for service escalations
  • Implement changes to service telemetry for automation consumption
  • Enhance customer facing experience through proactive alerting
  • Analyze data and provide operational insights to Design and Product teams

Requirements For Principal Site Reliability Engineer

Python
Java
  • 8+ years technical experience in software engineering, network engineering, or systems administration
  • 3+ years of operational experience in improving Service Reliability, Availability and Performance
  • Understanding of Observability and MELT implementation patterns for large-scale services
  • Experience in Logic Apps and authoring Jupyter Notebooks
  • Expertise in analyzing, troubleshooting, and automating root cause analysis
  • Systematic problem-solving approach with effective communication skills
  • 5+ years of hands-on experience in Python/Java/C#
  • Must pass Microsoft Cloud background check

Benefits For Principal Site Reliability Engineer

Medical Insurance
Parental Leave
401k
Education Budget
  • Industry leading healthcare
  • Educational resources
  • Discounts on products and services
  • Savings and investments
  • Maternity and paternity leave
  • Generous time away
  • Giving programs
  • Opportunities to network and connect

Interested in this job?

Jobs Related To Microsoft Principal Site Reliability Engineer

Director, Software Engineering, Site Reliability

Lead LinkedIn's Site Reliability Engineering team in Bangalore, driving infrastructure reliability and automation for the world's largest professional network.

Director, Software Engineering, Site Reliability

Lead Site Reliability Engineering at LinkedIn, directing 40+ engineers to ensure reliability of critical infrastructure systems including streaming, batch processing, and data platforms.

Principal Software Engineer - Site Reliability Engineering

Principal SRE position at Roblox leading reliability initiatives, building resilient systems, and mentoring engineers to support platform scaling for millions of users.

Director, Software Engineering, Site Reliability

Lead a 40+ person Site Reliability Engineering team at LinkedIn Bengaluru, focusing on infrastructure reliability, automation, and system scalability.

Director, Software Engineering, Site Reliability

Lead LinkedIn's Site Reliability Engineering team in Bengaluru, directing 40+ engineers and driving infrastructure reliability for critical systems.