Software Engineer - Incident Management

A global SaaS business delivering cloud monitoring, security, and analytics solutions that help organizations track their entire technology stack.
$130,000 - $300,000
Site Reliability
Mid-Level Software Engineer
Hybrid
5,000+ Employees
3+ years of experience
Enterprise SaaS

Description For Software Engineer - Incident Management

Datadog, a leading global SaaS company, is seeking a Software Engineer to join their Incident Management SRE team. This role is perfect for engineers passionate about building resilient systems and fostering a culture of continuous learning through incident response and analysis.

The position offers a unique opportunity to work at significant scale - processing trillions of data points daily while serving tens of thousands of companies. As part of the Incident Management SRE team, you'll play a crucial role in enhancing the company's incident response capabilities and on-call experience. The role combines technical expertise in Go, Python, and distributed systems with the soft skills needed to facilitate cross-team collaboration and learning.

Your responsibilities will span from developing software platforms supporting on-call rotations to leading post-mortem processes and training other engineers. The ideal candidate brings at least 3 years of software engineering experience, along with a strong background in incident response and distributed systems. You'll work in a hybrid environment that values both in-office collaboration and flexible work arrangements.

The compensation package is highly competitive, ranging from $130,000 to $300,000 USD, complemented by comprehensive benefits including equity grants, healthcare coverage, and professional development opportunities. Datadog's culture emphasizes pragmatism, honesty, and simplicity in solving complex problems, making it an ideal environment for engineers who want to make a significant impact while growing their careers.

This role offers the chance to work with cutting-edge technologies while helping shape how one of the fastest-growing observability platforms handles incidents and maintains reliability. If you're passionate about both technical excellence and teaching others, and want to be part of a company that values continuous learning and improvement, this position at Datadog could be your next career move.

Last updated 6 hours ago

Responsibilities For Software Engineer - Incident Management

  • Steer the on-call experience by establishing best practices and building platforms to support on-call rotations
  • Define incident response processes and write software to streamline the process
  • Contribute to the post-mortem process and run weekly postmortem reading group
  • Support teams in facilitating incident reviews
  • Train on-callers in incident and post-mortem processes
  • Engage in cross-functional collaborations with different teams

Requirements For Software Engineer - Incident Management

Go
Python
TypeScript
Kubernetes
  • At least 3 years of experience building software that solves real user problems
  • Experience with Go, Python, and TypeScript
  • Familiarity with Kubernetes and distributed systems
  • Experience being on-call and responding to incidents
  • Strong communication skills in English
  • Empathy and collaboration skills
  • Willingness to teach and train other engineers

Benefits For Software Engineer - Incident Management

401k
Dental Insurance
Education Budget
Equity
Medical Insurance
Mental Health Assistance
Vision Insurance
  • New hire stock equity (RSUs)
  • Employee stock purchase plan (ESPP)
  • Professional development and career pathing
  • Mentorship program
  • Mental health benefits
  • Healthcare
  • Dental benefits
  • 401(k) plan and match
  • Paid time off
  • Fitness reimbursements

Interested in this job?

Jobs Related To Datadog Software Engineer - Incident Management

Software Developer III, Site Reliability Development, Google Cloud

Site Reliability Development Engineer position at Google Cloud, focusing on building and maintaining large-scale distributed systems with competitive compensation and benefits.

Software Developer II, Site Reliability Developer, Google Cloud

Site Reliability Developer position at Google Cloud focusing on building and maintaining large-scale distributed systems with emphasis on reliability, automation, and system optimization.

Site Reliability Engineer, F1 SRE

Site Reliability Engineer position at Google focusing on maintaining and improving large-scale distributed systems for Google Cloud services.

Site Reliability Engineer

Site Reliability Engineer position at Google Dublin, combining software and systems engineering to ensure reliability of Google Cloud services.

Software Engineer III, Shopping Build Site Reliability Engineer

Software Engineer III position at Google focusing on Site Reliability Engineering for Shopping Build systems, requiring 2+ years of experience in distributed systems and software development.