Taro Logo

Principal Site Reliability Developer

A world leader in cloud solutions using tomorrow's technology to tackle today's challenges, partnering with industry-leaders for over 40+ years.
Site Reliability
Principal Software Engineer
In-Person
5,000+ Employees
5+ years of experience
Enterprise SaaS · Cloud

Job Description

As a senior member of the Site Reliability Engineering (SRE) team at Oracle, you'll play a crucial role in maintaining and improving our cloud infrastructure. This position combines deep technical expertise with leadership responsibilities, requiring both hands-on engineering skills and the ability to guide teams toward operational excellence.

You'll be responsible for designing and implementing high-availability architectures for large-scale distributed systems, while serving as the ultimate escalation point for complex operational issues. The role demands expertise in automation, monitoring, and system optimization, with a focus on maintaining robust SLAs and SLOs.

Oracle offers a compelling environment for SRE professionals, with access to cutting-edge cloud technology and the opportunity to work on systems that power thousands of enterprises worldwide. You'll collaborate with talented engineers across teams, mentor junior staff, and drive technical decision-making that impacts our global infrastructure.

The ideal candidate brings 3-5+ years of experience with strong Linux administration skills, Python programming expertise, and deep knowledge of distributed systems. You'll need to be comfortable with both writing production-grade software and managing complex infrastructure, while maintaining a focus on automation and operational efficiency.

This role offers competitive benefits, including medical, life insurance, and retirement options, along with opportunities for professional growth and development. Join Oracle's SRE team to tackle challenging technical problems while building and maintaining systems that operate at massive scale.

Last updated a month ago

Responsibilities For Principal Site Reliability Developer

  • Lead the design, automation, and support of OCI services with a focus on resiliency, security, scalability, and performance
  • Own and improve the end-to-end reliability metrics (SLOs, SLAs, KPIs) for services
  • Design and implement high-availability architectures for large-scale distributed systems
  • Serve as the ultimate escalation point for complex operational issues
  • Architect and build automation and orchestration tools
  • Collaborate with development teams to improve service designs
  • Guide technical decision-making and mentor junior SREs
  • Participate in and lead postmortems and root cause analysis
  • Contribute to capacity planning and demand forecasting
  • Participate in rotational on-call schedule

Requirements For Principal Site Reliability Developer

Python
Linux
Kubernetes
  • Advanced experience with Linux systems administration
  • Strong programming skills in Python
  • Advanced Bash/Shell scripting
  • Deep understanding of distributed systems, networking, and service architecture
  • Solid knowledge of databases
  • Strong understanding of CI/CD pipelines and DevOps best practices
  • Experience writing and maintaining unit tests
  • Proven ability to lead cross-functional efforts
  • 3 to 5+ years of experience
  • English language proficiency

Benefits For Principal Site Reliability Developer

Medical Insurance
Vision Insurance
Dental Insurance
  • Competitive benefits package
  • Medical insurance
  • Life insurance
  • Retirement options
  • Volunteer programs
  • Work-life balance

Related Jobs