Site Reliability Engineer

Cover Genius

Cover Genius is a Series E insurtech that protects the global customers of the world's largest digital companies including Booking Holdings, Intuit, Uber, Hopper, Ryanair, and more.

Sydney NSW, Australia

Site Reliability

Senior Software Engineer

Hybrid

AI · Finance

This job posting may no longer be active. You may be interested in these related jobs instead:

Senior Software Engineer - Site Reliability Engineering

Roblox

Senior SRE position at Roblox focusing on building resilient systems, automation tools, and monitoring solutions for a gaming platform serving millions of users.

Senior Site Reliability Engineer (Distributed Systems)

Workday

Senior Site Reliability Engineer position at Workday focusing on distributed systems and infrastructure reliability.

Senior Software Engineer, Site Reliability Tooling

Upstart

Senior SRE Engineer role at Upstart focusing on building tooling and automation for monitoring infrastructure health and creating reliable systems.

Service Reliability Engineer

Jobgether

Senior Service Reliability Engineer position at Jobgether, offering remote work across Asia, focusing on system stability and technical problem-solving with competitive benefits and equity.

Senior Site Reliability Engineer

Jobgether

Senior Site Reliability Engineer position at Jobgether, focusing on cloud infrastructure, Kubernetes, and AWS services with comprehensive benefits and remote work flexibility.

Description For Site Reliability Engineer

Cover Genius is a Series E insurtech that protects global customers of major digital companies. As a Site Reliability Engineer, you'll ensure reliable operation of production systems, working across technical areas to automate and improve platforms. Key responsibilities include:

Analyzing, testing, and modifying systems for reliability and performance
Developing observability tools and dashboards
Implementing automation tools, CI/CD pipelines, and reducing toil
Troubleshooting production issues
Applying AWS and GCP knowledge to maintain cloud infrastructure
Collaborating with Software Engineers to improve tools and procedures
Developing documentation and runbooks
Optimizing computing infrastructure costs

Requirements:

Understanding of SRE principles and best practices
Experience with modern observability tools (ELK/EFK, Prometheus, Grafana)
Scripting skills (Bash, Python, Go)
Experience with infrastructure as code (Terraform, Cloudformation)
Container technology knowledge (Docker, Kubernetes)
Linux experience
Networking and system architecture understanding
AWS/GCP knowledge
Bachelor's degree in Computer Science/Engineering or equivalent experience
Strong communication and documentation skills
Self-motivated learner with attention to detail

Join a diverse team across 20+ countries, recognized as the #1 fastest-growing company in APAC by the Financial Times in 2020. Be part of an innovative company that values being bold, authentic, purposeful, and inspired.

Last updated 7 months ago

Responsibilities For Site Reliability Engineer

Analyze, test and modify systems to improve reliability and optimize performance particularly at an architectural/infrastructure level
Develop and maintain observability tooling and dashboards
Implement automation tools and frameworks, CI/CD pipelines, Reduce toil
Troubleshoot production issues and coordinate with the development team to streamline code deployments
Apply AWS and GCP knowledge and skills to create & maintain cloud infrastructure for software projects
Design, develop and implement software integrations
Collaborate with Software Engineers and other team members with the goal of improving engineering tools, systems, procedures and data security
Develop and maintain design and troubleshooting documentation and runbooks
Optimize and control costs of the company's computing infrastructure

Requirements For Site Reliability Engineer

Linux

Python

Kubernetes

Understanding of SRE Principles and best practices
Experience using & configuring modern observability tools such as ELK/EFK, Prometheus, Grafana
Comfortable scripting & developing internal tooling with Bash and at least one programming language (e.g. python, go)
Experience working with infrastructure & configuration as code tools such as Terraform, Cloudformation, Chef, Puppet etc.
Experienced with container technology such as Docker and Ideally experienced with using and managing Kubernetes clusters
Experience working with Linux
Solid understanding of networking and system architecture
Solid understanding of how to deploy, scale and monitor web applications and databases
Good knowledge of AWS and/or GCP platforms and associated best practices
Bachelor Degree in Computer Science/Engineering or equivalent practical experience
Strong communication and documentation skills
Curious and self motivated learner
Professional approach
Good team member
Organisational and time management skills
Excellent attention to detail
Positive approach to change