I'm on-call this week and I'm met once again with a strong sense of not having a clue what I'm doing. I know the majority of the job of engineering is trying to work within a legacy system but I feel like I'm missing tactics to help me make any progress at all. That and having to switch contexts to address alarms and queries has meant that I actually made no progress today whatsoever so this is part rant and part question - any tips on confronting a wall of overwhelm and making progress towards the most impactful bugs?
One thing I'm gonna ask is do you have a sense of which systems are more important than others? Do you know what should be a page? Do you know what can be left aside etc? These are questions you should be asking yourself when dealing with pages and not having enough bandwith. I'm taking over for a job right now where I literally got 40 pages just this past weekend and the team we're taking over for would just work all weekend. I'm not accepting of that and am helping the team drive down man-hours partially by finding things we can address and doing root-cause analysis and then also finding deficiencies in the product itself and working with all the other teams whose products we use to drive beneficial outcomes reducing man-hours on our side. This is all important for reducing working hours and ensuring partner teams are acting at top efficiency.
This is not a direct answer to your question, but is there a ritual in your team/company around documenting what the on-call experience was like for each person?
If not, I highly recommend starting a "hand-off" system which includes a few bullet points about the week. There are 2 huge benefits:
You have two obligations while you are oncall:
If you're working on a bug but continually get pulled off, is it an operational bug, i.e. the source of the alarms that are going off? Or is it an an impactful bug but doesn't get in the way of operations? Maybe it's a CX defect that doesn't generate tickets. If it's the former, maybe your team is getting buried in ops, and it might make sense to do an ops sprint to clear things out. If it's happening to you it's likely happening to your team members. If it's the latter, STOP. You should only be doing 1 and 2 above. Non-urgent bug fixes should not be worked on by the oncall.
Oncall is a super chaotic and scary experience when you're just starting out, so I totally empathize with this. Here's my advice to tackle oncall at a high-level:
We actually gave an entire masterclass on debugging, so I highly recommend that too: [Masterclass] How To Become A Debugging Master And Fix Issues Faster
I also gave a senior -> staff case study on how I rebuilt the oncall in my Instagram org from the ground up. This is less directly applicable, but I do believe understanding the rhyme and reason behind what makes a good oncall helps you be more effective at oncall: [Case Study] Revamping Oncall For 20 Instagram Engineers - Senior to Staff Project