Teraswitch, a leading provider of high-performance bare metal servers, is seeking a Staff Software Engineer specializing in Site Reliability and Observability. With a presence in 20 datacenters worldwide and serving thousands of customers across 185 countries, Teraswitch stands as one of the largest privately-held infrastructure companies globally.
The role combines software engineering expertise with infrastructure management, focusing on ensuring system reliability, scalability, and performance. As a Staff SRE, you'll be responsible for implementing and maintaining a comprehensive observability platform, developing automation tools, and leading key technical initiatives across the organization.
Key responsibilities include system monitoring and troubleshooting, developing automation infrastructure, performance optimization, and incident response. You'll collaborate closely with development teams to incorporate reliability considerations into software design and implementation. The position requires participation in a 24/7 on-call rotation for critical systems.
The ideal candidate brings 7+ years of hands-on SRE experience with strong software development skills in languages like Java, Go, and Python. You should have extensive experience with monitoring tools (Grafana, Loki, Logstash), containerization (Docker, Kubernetes), and infrastructure as code (Terraform, Chef, Ansible).
Teraswitch offers a competitive compensation package ranging from $175K to $250K, along with comprehensive benefits including health, dental, and vision insurance, 401k with profit sharing, flexible PTO, and paid holidays. This is an opportunity to join a growing company and make a significant impact on the reliability and scalability of critical infrastructure systems.