0

Caused a SEV - Publish RCA?

Profile picture
Mid-Level Data Engineer at Taro Community10 days ago

In my 3rd week at Big Tech, I was over-eager to get my first PR merged (into a particular legacy Airflow repo) so I could complete my first ticket and show some progress. I made probably the most classic rookie-mistake of not properly testing my code in staging, and my code ended up causing a sev that took down Prod for about 30 minutes. Since this was a particular legacy Airflow repo, it wasn't the end of the world since only internal workers were affected, only a small subset of the company, and it happened at night. Still, this was a pretty bad look for me to my manager and I've been working hard since to make a better impression.

At my company, for every sev, there's a process to write up a Root Cause Analysis (RCA) Doc where you describe the issue, 5-whys for why it happened, the timeline for how it happened, who it affected, and a few other details. There's technically an SLA of 2 weeks set to each RCA, but looking at other RCA docs, I see a lot of them were never actually filled out.

From my perspective, the reason for the sev was simple: I didn't adequately test in staging. The oncall guy who helped me navigate the issue encouraged me to not personalize it as much and to think in terms of the process, e.g. testing in staging should have been required or canary testing in prod should have caught and rolled back my code.

I have filled out the RCA doc on Confluence and can publish it but am hesitant to do so because I'm concerned about reminding people that I caused the sev.

I have 2 concrete questions:

  1. Should I publish the RCA? I can prob get away without doing it since it doesn't look like it's enforced. I guess if someone does decide to follow up and sees I haven't filled it out it could be a bad look, but given that it's been over a month since I was supposed to publish it, it doesn't seem likely.
  2. If I should publish, should I look to engage with my oncall mentor regarding implementing some of his feedback for making testing in staging required or canary rollbacks? It seems like a lot of work to do (for myself or others) for something that was a "me-error" and people might get annoyed if they now have to test everything in staging when they don't currently have to. It also takes me away from my current tickets (which my manager prob cares more about me completing).

Thank you for reading this!

25
2

Discussion

(2 comments)
  • 1
    Profile picture
    Tech Lead @ Robinhood, Meta, Course Hero
    8 days ago

    If you've already written everything up and it's just a click of the "Publish" button away, I feel like you should just hit the button (but don't draw any more attention to it).

    If there's meaningful work on top of pressing a button, I would just drop it and move on. It seems like the negative impact of this SEV was pretty minimal.

    Zooming out, hiding the fact that you did something bad isn't a great motivation. The factors here are more about your general productivity and spending your time well. It seems like this could just be a giant distraction if sharing out the full RCA is a meaningful amount of work.

  • 0
    Profile picture
    Mid-Level Data Engineer [OP]
    Taro Community
    10 days ago

    One thing I should clarify is I did test locally, but only with 1 dag, and the issue only appeared at scale when many dags were run. If I had simply build and deployed the code to staging, I would have seen staging become unavailable because multiple dags would have run my change. Because this was my first PR, I didn't really know staging was a thing, or perhaps I should say I was too set on getting my PR deployed to think about it. Just want to clarify that I take ownership of my mistake, but add a little context.