1

What are the consequences of breaking production for a struggling engineer?

Profile picture
Entry-Level Software Engineer at Taro Community9 months ago

This is a serious question involving a friend who met only 3 out of 5 expectations during his mid-year review. He was hired as an IC2 and is about to reach the one-year mark. His team allowed him to approve and merge his own code into production, which resulted in a crash of the application server and AWS environment. This constitutes his first mistake, and someone has contacted the VP in the organization. What potential consequences could he face?

185
7

Discussion

(7 comments)
  • 4
    Profile picture
    Thoughtful Tarodactyl
    Taro Community
    9 months ago

    As other commenters have alluded to, this sounds like a process failure - no one engineer/person should be allowed to merge code without other reviewers, and ideally there is a validation process/test environment so that such errors are caught long before the buggy code hits production.

    If your friend made a best effort to test their code and followed the established code-deployment process at their company, perhaps this is an opportunity to improve the process. If not, hopefully this gets resolved without too much consequence, and they can "go slow to go fast" next time.

  • 4
    Profile picture
    Founder of Expanded Skills • Former Head of Engineering
    9 months ago

    Is it a telltale sign if none of his teammates are talking to him anymore, except for his manager, and they are no longer reviewing his code? This started after the server crashed the other day.

    Doesn't sound like a healthy work environment at all.

    Like others have mentioned, the right intervention is a system level fix (i.e broken process without enough guardrails).

    It is normal to be very stressed and worried in his situation. However, the right course of action for him is to have a very candid conversation that covers:

    • Acknowledge what he could have done better and show the preventative measures put in place, so that it's unlikely to happen again
    • Transition from the prior point to needing some help at the system/process level to make sure it's fixed properly

    I wouldn't be opposed if he started looking for another job since the trust is completely fractured here and I personally wouldn't be comfortable working with colleagues that leave you to suffer just to avoid the blast radius.

  • 3
    Profile picture
    Eng @ Taro
    9 months ago

    I've tried to break down some action items in the immediate term, short term, and long term:

    Can they roll back their code immediately to not let the wound bleed any further? It sounds like they are trying to do a forward fix, but it's taking a while. I would at least try to mitigate the current problem and bring the system to a stable state. They will need to do as much damage control right now to not let the problem compound further.

    In the short term, I would make sure that new PRs have a code review from a different person before they can be merged in. The PR descriptions should have a template that include what the author did to test their code. I would also try to schedule a retrospective to outline what went wrong and what process changes could have helped prevent the issue. This can be presented to leadership to try to help save your friend by diverting the attention from him onto the missing processes.

    In the long term, I would make sure there is some staging environment with either automated tests or manual tests that can detect the most basic, and most impactful, outages.

  • 2
    Profile picture
    Eng @ Taro
    9 months ago

    Not all mistakes are equal, so it depends on the severity of the crash that he caused: was it due to negligence, how long was the outage, and were they actively involved in the process to fix the mistake?

    His team allowed him to approve and merge his own code into production

    By approve and merge his own code, does this mean that no one reviewed his code before he could merge it?

    If the engineer made a mistake because they didn't even run through the most basic flow to check that their code works, I would have a discussion about them about this.

    It also sounds like there are really tight guardrails around software engineers if they were just able to gain the privilege to merge in code even though they've been at the company for almost one year. There has to be freedom for engineers to be able to make mistakes so they can grow.

    I would look at whether there could be any process changes that could have prevented this issue. Is there a staging environment that runs tests on the new build that could have caught this issue before it was deployed to production?

  • 1
    Profile picture
    Tech Lead @ Robinhood, Meta, Course Hero
    9 months ago

    It really depends on how healthy (or unhealthy) the engineering culture of the organization is. I have seen both ends of the spectrum:

    • Worst case: Literally fired
    • Best case: Company embraces blameless culture and treats it as a learning and growth opportunity

    My advice for your friend is the following:

    • Have an honest conversation with their manager - They need to figure out if this incident dings them, and if so, by how much. Follow the advice here: [Masterclass] How To Work Better With Your Engineering Manager
    • Publicly learn from the experience - Their #1 priority should be to show the team that they're 100% serious about learning from this mistake, becoming a better engineer, and striving to never make a similar mistake ever again. The best companies have blameless culture and post-mortems, and every engineer can help foster that culture by showing a growth mentality and being humble with their mistakes. Here's a good thread about all that: "How to turn a string of silly mistakes into a mature positive outcome?"
  • 1
    Profile picture
    Entry-Level Software Engineer [OP]
    Taro Community
    9 months ago

    I was shock when he told me there was no code review for it. That can’t be he told me it was rushing to get it into production. It was mostly negligence. He spend all day in fixing it and the fix hasn’t been deployed yet. Last time we spoke his manager was fighting for him. There was no immediate deadline guess he was being very hubris/egoistical.

  • 0
    Profile picture
    Entry-Level Software Engineer [OP]
    Taro Community
    9 months ago

    Is it a telltale sign if none of his teammates are talking to him anymore, except for his manager, and they are no longer reviewing his code? This started after the server crashed the other day.