Taro Logo

Staff Software Engineer - Site Reliability and Observability

Teraswitch provides high-performance, low-latency bare metal servers worldwide, serving thousands of customers across 185 countries through 20 datacenter locations.
$175,000 - $250,000
Site Reliability
Staff Software Engineer
In-Person
7+ years of experience
Enterprise SaaS

Job Description

Teraswitch, a leading provider of high-performance bare metal servers, is seeking a Staff Software Engineer specializing in Site Reliability and Observability. With a presence in 20 datacenters worldwide and serving thousands of customers across 185 countries, Teraswitch stands as one of the largest privately-held infrastructure companies globally.

The role combines software engineering expertise with infrastructure management, focusing on ensuring system reliability, scalability, and performance. As a Staff SRE, you'll be responsible for implementing and maintaining a comprehensive observability platform, developing automation tools, and leading key technical initiatives across the organization.

Key responsibilities include system monitoring and troubleshooting, developing automation infrastructure, performance optimization, and incident response. You'll collaborate closely with development teams to incorporate reliability considerations into software design and implementation. The position requires participation in a 24/7 on-call rotation for critical systems.

The ideal candidate brings 7+ years of hands-on SRE experience with strong software development skills in languages like Java, Go, and Python. You should have extensive experience with monitoring tools (Grafana, Loki, Logstash), containerization (Docker, Kubernetes), and infrastructure as code (Terraform, Chef, Ansible).

Teraswitch offers a competitive compensation package ranging from $175K to $250K, along with comprehensive benefits including health, dental, and vision insurance, 401k with profit sharing, flexible PTO, and paid holidays. This is an opportunity to join a growing company and make a significant impact on the reliability and scalability of critical infrastructure systems.

Last updated a day ago

Responsibilities For Staff Software Engineer - Site Reliability and Observability

  • Implement scalable, reliable, secure SRE and Observability platform
  • Monitor system performance and availability
  • Develop and maintain automation tools
  • Analyze and optimize system performance
  • Conduct incident response and root cause analysis
  • Collaborate with development teams
  • Deliver tools/software to improve reliability and scalability
  • Serve as technical leader for key initiatives
  • Participate in 24/7 On-call Rotation
  • Improve best practices through technical implementations

Requirements For Staff Software Engineer - Site Reliability and Observability

Java
Go
Python
Kubernetes
  • 7+ years of hands-on SRE experience with Software Development experience (Java, golang, python)
  • Experience building and operating high-availability, fault-tolerant, scalable distributed software
  • Experience with monitoring and logging tools (Grafana, Loki, Logstash, Clickhouse)
  • Experience with SDLC and deployment
  • Strong working knowledge of Docker, Kubernetes, Terraform, Chef or Ansible
  • Experience troubleshooting production applications
  • BS/MS in Computer Science/Engineering preferred

Benefits For Staff Software Engineer - Site Reliability and Observability

Medical Insurance
Dental Insurance
Vision Insurance
401k
  • Health Insurance
  • Dental Insurance
  • Vision Insurance
  • 401k with company profit sharing
  • Flex PTO
  • 11 Company Paid Holidays

Related Jobs