I introduced a bug and I feel horrible.

Anonymous User at Taro Community3 years ago

A little context: I joined my company back in Jan and have been doing really well. I’ve been consistently pushing code, owning projects, and creating good relationships with my colleagues.

This week I’ve been really sick and it feels like nothing is going right. I have been running an experiment for a couple months and today I found out that one of the arms is bugged because I introduced it a while ago. After a scrambled meeting with my lead and manager, I addressed the bug and pushed a fix but because of freezes my bug fix won’t actually be in til after year end. My manager stated that they wouldn’t have expect an engineer to not notice a bug like this go on for so long.

The business side is actually not even interested in the bugged arm and would like to go with an approach I proposed earlier in the project.

I’ve learned a lot since I pushed the bugged change and my code and communication has improved but my managers comment plus all the news about layoffs and perfs and PIPs has had my anxiety through the roof the past few weeks. Anyone else feeling the same?

213213 Views

44 Comments

Discussion

(4 comments)

3
Alex Chiou
•Robinhood, Meta, Course Hero, PayPal
3 years ago
Sorry to hear about your recent rough patch - It happens to all of us! That being said, here are my thoughts:

You shouldn't be working while sick - Software engineers can add a ton of business value quickly, but they can also destroy a ton of business value quickly. This is why it's important for SWEs to avoid pushing code when they're sick, especially if they're really sick like yourself. This is the classic scenario where someone puts in a bunch of hours but their impact doesn't match as things like bad bugs generate negative business value.

In general, take breaks - If you haven't taken a proper break in a while, do so for everything I talked about in the prior point. The best SWEs understand their limits, and if you were a high-performer before this, it's simply a matter of bringing back that version of yourself. If you're going to take several days off, make sure to plan it out well and coordinate with your manager by following the advice here: "How much PTO should I take?"

Turn your mistake into growth - Software is hard, which is why even the best of us will inevitably mess up (and mess up big). What separates the best engineers here is they don't let the mistakes get to them and use them as inspiration to get better. Actions speak louder than words: Show your manager and team that you're going to take code quality and system stability even more seriously in the future by wrapping your future work with more safeguards and proactive thinking. Make it so that your manager will never be surprised with low execution quality ever again - Having that mentality is what will protect you the most in this layoff climate. I talk about recovering from mistakes more in-depth in this discussion: "How to turn a string of silly mistakes into a mature positive outcome?"

For a lot of tactical advice around upholding system quality and just writing good code in general, I highly recommend these:

[Course] Level Up Your Code Quality As A Software Engineer

[Course] System Design Masterclass: Shipping Real Features To Production
2
Senior Software Engineer
•Grab
3 years ago
Suggest you to read through this resource and the answers in the thread:

https://techcareergrowth.slack.com/archives/C01M20V3MEJ/p1661834266692019

https://techcareergrowth.slack.com/archives/C01M20V3MEJ/p1661862937399349?thread_ts=1661834266.692019&cid=C01M20V3MEJ
2
Alex Chiou
•Robinhood, Meta, Course Hero, PayPal
3 years ago
To make things easier, I'll inline the thread from Tech Career Growth Slack. Keep in mind that I did not write this and the Grab Senior Engineer deserves all the credit (this will be much better once we have public identities in Taro Q&A, haha).

Sharing this in the channel, as I feel this is quite relevant to all of us:

The first thing to realise in case of any production incident is the fact that "you're not alone". Unless you are the sole person with the know how and handling an entire service or application by yourself (for early stage startups), there would always be other peers, leads and engineering managers that you can always loop in and take help from. This is especially crucial as folks who face production incidents for the first time end up feeling they would be targeted and it would leave a black mark on their careers in the organisation, especially in the times of remote work. This however, could not be further from the truth.

"Ownership" and "Responsibility" are one of the key aspects of a good software engineer, especially important for your career growth. If you realise that there is some issue that is linked to you, and you can support it, I would suggest you do. You can pour in your thoughts on how a certain item you worked on, may be linked to the issue at hand. In case you're able to provide analysis on the impact, or can just mention to the people handling the issue that you worked on a related aspect and can assist them, it would be a great thing to do. (proving that you're proactive and can go above and beyond your current assigned items)

The most important thing in case of a software incident is to put the fire out. This may involve a quick fix or something that may involve planning for both a short term and a long term fix.

The next step is RCA preparation. This involves eliciting the following:

Eliciting how the issue was identified : Is it via automated monitoring or reported by clients/users? Former is preferable.

What were the series of events after it was identified?

How long did it take for each of the above event to occur ? Could the turn around time be shorter? How to make it shorter?

The 5 why's analysis: Why did the issue happen? Why did the answer to this happen ? and so on. This helps you to reach core issues concerning the service as a whole, fixing which will have a high yield preventing future issues as well.

Lessons learnt

Did the lack of unit tests cause the issue?

Is there even a culture of writing UTs within the organisation?

Do you have regression tests that validate none of the production flows break?

Was the code reviewed before it was sent out?

How do we ensure this issue does not reoccur going ahead?

Once we have the RCA, I would suggest to setup a Postmortem discussion on the issue. This could be a recurring call. Fixing issues in silos does little help to the service as the learnings are not evangelised.

Share it with the team. If possible, beyond your team as well. Fostering a culture of learning starts with all of us. And taking the first step in that direction can lead to a much better state.

With the ever growing size of services, it is rarely a single factor or person that causes the issue, but a bunch of factors and weak processes combined, that lead to the issue. Fixing them at its core helps the services.

Coming back to one of the points in the original post, on whether an apology should be put out. Personally, I feel that as software engineers who grow up to be software product owners within an organisation, we need to try and better the processes, which can act as multiple quality gates to leave little to no room for issue. An apology does little with that regards. Coming up with ways to contribute to the issue at hand, and improving the service as a whole is something that is much much more impactful.

Sharing a much relevant document on blameless incident handling: https://www.atlassian.com/incident-management/postmortem/blameless, https://sre.google/sre-book/postmortem-culture/

Again, everything in the block is not written by me and is written by the Grab Senior Engineer.
0
Architect
•Self
3 years ago
Hello - I see links to the techcareergrowth channel on Slack. How do I get access?

Thanks