Senior Software Engineer - Incident Management

Global SaaS business delivering cloud monitoring, security, and analytics platform that enables digital transformation and cloud migration.
Site Reliability
Senior Software Engineer
Hybrid
5,000+ Employees
3+ years of experience
Enterprise SaaS

Description For Senior Software Engineer - Incident Management

Datadog, a leading global SaaS company, is seeking a Senior Software Engineer to join their Incident Management SRE team. This role focuses on fostering a resilient culture by leveraging incidents as learning opportunities and catalysts for growth. The position involves close collaboration with teams across departments to enhance on-call experience, incident response, and post-incident analysis.

As a Senior Software Engineer in Incident Management, you'll be responsible for building and improving platforms that support on-call rotations, streamlining incident response processes, and facilitating post-mortem analyses. You'll work with Go, Python, and TypeScript in a distributed systems environment, helping teams navigate complex technical challenges while maintaining system reliability.

The ideal candidate brings at least 3 years of software engineering experience, strong knowledge of Kubernetes and distributed systems, and a track record of on-call incident response. You should be passionate about teaching others and driving organizational improvements through influence and collaboration.

Datadog offers a hybrid work environment, competitive benefits including equity compensation (RSUs and ESPP), and a strong focus on professional development. The company maintains an inclusive culture with various employee resource groups and emphasizes continuous learning and growth. This role provides an opportunity to make a significant impact on how a major tech company handles incidents and maintains system reliability while working with cutting-edge technologies.

Last updated 2 days ago

Responsibilities For Senior Software Engineer - Incident Management

  • Steer the on-call experience by establishing best practices and building platforms to support on-call rotations
  • Define incident response processes and write software to streamline the process
  • Contribute to the post-mortem process and run weekly postmortem reading group
  • Support teams in facilitating incident reviews
  • Train on-callers in incident and post-mortem processes
  • Engage in cross-functional collaborations with different teams

Requirements For Senior Software Engineer - Incident Management

Go
Python
TypeScript
Kubernetes
  • At least 3 years of experience building software that solves real user problems
  • Familiarity with Kubernetes and distributed systems
  • Experience being on-call and responding to incidents
  • Strong communication skills in English
  • Experience with Go, Python, and TypeScript
  • Empathy and collaboration skills
  • Willingness to teach and train other engineers

Benefits For Senior Software Engineer - Incident Management

Equity
Mental Health Assistance
  • New hire stock equity (RSUs) and employee stock purchase plan (ESPP)
  • Continuous professional development and career pathing
  • Intradepartmental mentor and buddy program
  • Inclusive company culture with Community Guilds
  • Access to Inclusion Talks
  • Free global mental health benefits for employees and dependents
  • Competitive global benefits

Interested in this job?

Jobs Related To Datadog Senior Software Engineer - Incident Management

Site Reliability Engineer

Senior Site Reliability Engineer position at Wheely, focusing on infrastructure security, monitoring, and DevOps practices in Nicosia, Cyprus.

Senior Software Engineer, Site Reliability Engineering

Senior SRE position at Adobe working on Identity Services, focusing on scalability, reliability and zero downtime for systems handling millions of requests.

Site Reliability Engineer - Cloud

Senior Site Reliability Engineer position at NVIDIA focusing on AWS infrastructure and cloud services, offering competitive compensation and opportunity to work with cutting-edge technology.

Senior Software Engineer, Site Reliability Tooling

Senior Site Reliability Engineer role at Upstart, focusing on tooling and automation for infrastructure reliability. Remote-friendly position with competitive compensation and comprehensive benefits.

Site Reliability Engineer

Senior Site Reliability Engineer position at Bounteous in Montreal, focusing on system reliability, ServiceNow administration, and operational excellence in a hybrid work environment.