A little context: I joined my company back in Jan and have been doing really well. I’ve been consistently pushing code, owning projects, and creating good relationships with my colleagues.
This week I’ve been really sick and it feels like nothing is going right. I have been running an experiment for a couple months and today I found out that one of the arms is bugged because I introduced it a while ago. After a scrambled meeting with my lead and manager, I addressed the bug and pushed a fix but because of freezes my bug fix won’t actually be in til after year end. My manager stated that they wouldn’t have expect an engineer to not notice a bug like this go on for so long.
The business side is actually not even interested in the bugged arm and would like to go with an approach I proposed earlier in the project.
I’ve learned a lot since I pushed the bugged change and my code and communication has improved but my managers comment plus all the news about layoffs and perfs and PIPs has had my anxiety through the roof the past few weeks. Anyone else feeling the same?
Sorry to hear about your recent rough patch - It happens to all of us! That being said, here are my thoughts:
For a lot of tactical advice around upholding system quality, I highly recommend my system design series as well: System Design Masterclass: Taro Playlists
Suggest you to read through this resource and the answers in the thread:
Hello - I see links to the techcareergrowth channel on Slack. How do I get access?
To make things easier, I'll inline the thread from Tech Career Growth Slack. Keep in mind that I did not write this and the Grab Senior Engineer deserves all the credit (this will be much better once we have public identities in Taro Q&A, haha).
Sharing this in the channel, as I feel this is quite relevant to all of us:
The first thing to realise in case of any production incident is the fact that "you're not alone". Unless you are the sole person with the know how and handling an entire service or application by yourself (for early stage startups), there would always be other peers, leads and engineering managers that you can always loop in and take help from. This is especially crucial as folks who face production incidents for the first time end up feeling they would be targeted and it would leave a black mark on their careers in the organisation, especially in the times of remote work. This however, could not be further from the truth.
"Ownership" and "Responsibility" are one of the key aspects of a good software engineer, especially important for your career growth. If you realise that there is some issue that is linked to you, and you can support it, I would suggest you do. You can pour in your thoughts on how a certain item you worked on, may be linked to the issue at hand. In case you're able to provide analysis on the impact, or can just mention to the people handling the issue that you worked on a related aspect and can assist them, it would be a great thing to do. (proving that you're proactive and can go above and beyond your current assigned items)
The most important thing in case of a software incident is to put the fire out. This may involve a quick fix or something that may involve planning for both a short term and a long term fix.
The next step is RCA preparation. This involves eliciting the following:
Once we have the RCA, I would suggest to setup a Postmortem discussion on the issue. This could be a recurring call. Fixing issues in silos does little help to the service as the learnings are not evangelised.
Share it with the team. If possible, beyond your team as well. Fostering a culture of learning starts with all of us. And taking the first step in that direction can lead to a much better state.
With the ever growing size of services, it is rarely a single factor or person that causes the issue, but a bunch of factors and weak processes combined, that lead to the issue. Fixing them at its core helps the services.
Coming back to one of the points in the original post, on whether an apology should be put out. Personally, I feel that as software engineers who grow up to be software product owners within an organisation, we need to try and better the processes, which can act as multiple quality gates to leave little to no room for issue. An apology does little with that regards. Coming up with ways to contribute to the issue at hand, and improving the service as a whole is something that is much much more impactful.
Sharing a much relevant document on blameless incident handling: https://www.atlassian.com/incident-management/postmortem/blameless, https://sre.google/sre-book/postmortem-culture/
Again, everything in the block is not written by me and is written by the Grab Senior Engineer.