Software Engineer - Incident Management

Datadog

A global SaaS business delivering cloud monitoring, security, and analytics solutions that help organizations track their entire technology stack.

Boston, MA, USA • New York, NY, USA

$130,000 - $300,000

Site Reliability

Mid-Level Software Engineer

Hybrid

5,000+ Employees

3+ years of experience

Enterprise SaaS

Description For Software Engineer - Incident Management

Datadog, a leading global SaaS company, is seeking a Software Engineer to join their Incident Management SRE team. This role is perfect for engineers passionate about building resilient systems and fostering a culture of continuous learning through incident response and analysis.

The position offers a unique opportunity to work at significant scale - processing trillions of data points daily while serving tens of thousands of companies. As part of the Incident Management SRE team, you'll play a crucial role in enhancing the company's incident response capabilities and on-call experience. The role combines technical expertise in Go, Python, and distributed systems with the soft skills needed to facilitate cross-team collaboration and learning.

Your responsibilities will span from developing software platforms supporting on-call rotations to leading post-mortem processes and training other engineers. The ideal candidate brings at least 3 years of software engineering experience, along with a strong background in incident response and distributed systems. You'll work in a hybrid environment that values both in-office collaboration and flexible work arrangements.

The compensation package is highly competitive, ranging from $130,000 to $300,000 USD, complemented by comprehensive benefits including equity grants, healthcare coverage, and professional development opportunities. Datadog's culture emphasizes pragmatism, honesty, and simplicity in solving complex problems, making it an ideal environment for engineers who want to make a significant impact while growing their careers.

This role offers the chance to work with cutting-edge technologies while helping shape how one of the fastest-growing observability platforms handles incidents and maintains reliability. If you're passionate about both technical excellence and teaching others, and want to be part of a company that values continuous learning and improvement, this position at Datadog could be your next career move.

Last updated 6 hours ago

Responsibilities For Software Engineer - Incident Management

Steer the on-call experience by establishing best practices and building platforms to support on-call rotations
Define incident response processes and write software to streamline the process
Contribute to the post-mortem process and run weekly postmortem reading group
Support teams in facilitating incident reviews
Train on-callers in incident and post-mortem processes
Engage in cross-functional collaborations with different teams

Requirements For Software Engineer - Incident Management

Python

TypeScript

Kubernetes

At least 3 years of experience building software that solves real user problems
Experience with Go, Python, and TypeScript
Familiarity with Kubernetes and distributed systems
Experience being on-call and responding to incidents
Strong communication skills in English
Empathy and collaboration skills
Willingness to teach and train other engineers

Benefits For Software Engineer - Incident Management

401k

Dental Insurance

Education Budget

Equity

Medical Insurance

Mental Health Assistance

Vision Insurance

New hire stock equity (RSUs)
Employee stock purchase plan (ESPP)
Professional development and career pathing
Mentorship program
Mental health benefits
Healthcare
Dental benefits
401(k) plan and match
Paid time off
Fitness reimbursements

Datadog

A global SaaS business delivering cloud monitoring, security, and analytics solutions that help organizations track their entire technology stack.

Boston, MA, USA • New York, NY, USA

$130,000 - $300,000

Site Reliability

Mid-Level Software Engineer

Hybrid

5,000+ Employees

3+ years of experience

Enterprise SaaS

Interested in this job?

Jobs Related To Datadog Software Engineer - Incident Management

Software Developer III, Site Reliability Development, Google Cloud

Google

Site Reliability Development Engineer position at Google Cloud, focusing on building and maintaining large-scale distributed systems with competitive compensation and benefits.

Software Developer II, Site Reliability Developer, Google Cloud

Google

Site Reliability Developer position at Google Cloud focusing on building and maintaining large-scale distributed systems with emphasis on reliability, automation, and system optimization.

Site Reliability Engineer, F1 SRE

Google

Site Reliability Engineer position at Google focusing on maintaining and improving large-scale distributed systems for Google Cloud services.

Site Reliability Engineer

Google

Site Reliability Engineer position at Google Dublin, combining software and systems engineering to ensure reliability of Google Cloud services.

Software Engineer III, Shopping Build Site Reliability Engineer

Google

Software Engineer III position at Google focusing on Site Reliability Engineering for Shopping Build systems, requiring 2+ years of experience in distributed systems and software development.