Site Reliability Engineering II

Microsoft is a company that builds cloud platforms, software, and services, leading digital transformation in the age of cloud and AI.
$98,300 - $193,200
Site Reliability
Mid-Level Software Engineer
Remote
5,000+ Employees
4+ years of experience
Enterprise SaaS · Cloud

Description For Site Reliability Engineering II

Microsoft's Azure Data engineering team is seeking a Site Reliability Engineer II to join their databases team, focusing on operational Database systems. This role is part of Azure Cosmos DB, Microsoft's globally distributed, massively scalable, multi-model cloud database service.

As an SRE II, you'll be responsible for maintaining and improving service reliability for one of Azure's fastest-growing services. The position involves working with critical systems in Healthcare, Retail, Telecommunications, and IoT, where service availability and latency are paramount. Azure Cosmos DB provides financially backed SLAs of 99.99% availability and <10ms latency.

Key responsibilities include:

  • Building and optimizing solutions for analyzing massive amounts of telemetry and service health indicators in near real-time
  • Performing automated root cause analysis and implementing necessary mitigations to restore SLOs
  • Collaborating with engineering teams on automation solutions
  • Working directly with enterprise customers to resolve service escalations
  • Contributing to the enhancement of customer-facing experiences through proactive monitoring and alerting

The role offers competitive compensation ($98,300 - $193,200 base salary range) and comprehensive benefits including healthcare, educational resources, and parental leave. This is a remote-friendly position with up to 100% work from home flexibility and 0-25% travel requirements.

The ideal candidate will bring 4+ years of technical experience in software engineering or systems administration, with specific expertise in SRE practices and cloud services. You'll join a diverse team that values different perspectives and operates with a startup mindset while having the resources and impact of a global technology leader.

This is an excellent opportunity for someone passionate about service reliability, automation, and working with cutting-edge cloud technology at scale. You'll be at the forefront of building and shaping the Livesite Automation and AI Ops stack in Cosmos DB, leading the path for broader adoption across Microsoft Azure.

Last updated 2 days ago

Responsibilities For Site Reliability Engineering II

  • Collaborating with engineering teams on building and enhancing tooling and automation solutions
  • Working with customers to understand pain points around Supportability and SLO attainment
  • Implementing changes to service telemetry for automation consumption
  • Enhancing customer facing experience through proactive alerting
  • Analyzing data and providing operational insights to Design and Product teams
  • Interface with large enterprise customers for handling service escalations

Requirements For Site Reliability Engineering II

Python
  • 4+ years technical experience in software engineering, network engineering, or systems administration
  • 3+ years of SRE or SWE experience running large scale cloud services
  • 2+ years of operational experience in improving Service Reliability, Availability and Performance
  • Understanding of Observability and MELT implementation patterns for large-scale services
  • Experience in Logic Apps and authoring Jupyter Notebooks
  • Systematic problem-solving approach with effective communication skills
  • Ability to deal with ambiguity in a fast-paced environment

Benefits For Site Reliability Engineering II

Medical Insurance
Parental Leave
401k
Education Budget
  • Industry leading healthcare
  • Educational resources
  • Discounts on products and services
  • Savings and investments
  • Maternity and paternity leave
  • Generous time away
  • Giving programs
  • Opportunities to network and connect

Interested in this job?

Jobs Related To Microsoft Site Reliability Engineering II

Site Reliability Engineer II

Microsoft is hiring a Site Reliability Engineer II to join their Security team, focusing on cloud infrastructure reliability and security solutions with competitive pay and benefits.

Site Reliability Engineer II

Site Reliability Engineer II position at Microsoft working on the Fabric platform team, ensuring reliability and performance of cloud data services with up to 100% remote work option.

Site Reliability Engineer II

Microsoft is hiring a Site Reliability Engineer II to join their Security team, focusing on infrastructure reliability and security solutions with hybrid work options in Redmond, WA.

Site Reliability Engineer II- CTJ - Top Secret

Microsoft is hiring a Site Reliability Engineer II to help secure and maintain large-scale cloud services, requiring Top Secret clearance and offering hybrid work in Redmond, WA.

Site Reliability Engineer

Microsoft is seeking a Site Reliability Engineer to support and secure virtualization services, focusing on Azure infrastructure and identity management with DevOps practices.