It's oncall again and I'm unsure how to be the most productive - Any tips?

Anonymous User at Taro Community2 years ago

I'm on-call this week and I'm met once again with a strong sense of not having a clue what I'm doing. I know the majority of the job of engineering is trying to work within a legacy system but I feel like I'm missing tactics to help me make any progress at all. That and having to switch contexts to address alarms and queries has meant that I actually made no progress today whatsoever so this is part rant and part question - any tips on confronting a wall of overwhelm and making progress towards the most impactful bugs?

917917 Views

44 Comments

Discussion

(4 comments)

10
Steve Huynh
•Principal Software Engineer at Amazon
2 years ago
You have two obligations while you are oncall:

Field tickets.

Effect change so the system that you own runs better (and therefore generates less tickets).

If you're working on a bug but continually get pulled off, is it an operational bug, i.e. the source of the alarms that are going off? Or is it an an impactful bug but doesn't get in the way of operations? Maybe it's a CX defect that doesn't generate tickets. If it's the former, maybe your team is getting buried in ops, and it might make sense to do an ops sprint to clear things out. If it's happening to you it's likely happening to your team members. If it's the latter, STOP. You should only be doing 1 and 2 above. Non-urgent bug fixes should not be worked on by the oncall.
4
Rahul Pandey
•Tech Lead/Manager at Meta, Pinterest, Kosei
2 years ago
This is not a direct answer to your question, but is there a ritual in your team/company around documenting what the on-call experience was like for each person?

If not, I highly recommend starting a "hand-off" system which includes a few bullet points about the week. There are 2 huge benefits:

You start to see that you're not alone in feeling lost or scatter-brained. When other people do their retrospective, you can ask them (or read their notes) how they did various tasks and where they spent their time.

The retrospectives shine a light on potential problems, and then you can decide as a team whether (1) to do additional training to handle common scenarios or (2) build tooling to make life easier for the on-call.
3
Alex Chiou
•Tech Lead @ Robinhood, Meta, Course Hero
2 years ago
Oncall is a super chaotic and scary experience when you're just starting out, so I totally empathize with this. Here's my advice to tackle oncall at a high-level:

Don't be afraid to ask lots of questions - Even though 1 person is on point to be oncall, the entire team is responsible to support that person. Protecting the integrity of your system is a team-wide responsibility. Use this video as a guide: Asking Effective Questions That Get Great Answers Quickly

Compartmentalize as much as you can - When you're oncall, find the highest priority issue and solely focus on that, pretending the other issues on your plate don't exist. After you fix that #1 issue, move on to the next one and repeat this process until your plate is clear. This focus is especially important for more junior engineers and those new to oncall. It's effectively a more dynamic version of focus blocks (you need to check if incoming issues are more important than your current one): A Powerful Tool For Software Engineer Productivity - Focus Blocks

Write everything down - You will almost always get multiple issues while you're oncall, so you need to be able to switch between them when a new bug comes in or a backlogged bug has a new issue that boosts its priority. By diligently maintaining a paper trail, it's way easier to context switch between bugs as necessary as you can easily pick up where you left off. We talk about this in-depth here: "Tips for someone with poor working memory?"

We actually gave an entire masterclass on debugging, so I highly recommend that too: [Masterclass] How To Become A Debugging Master And Fix Issues Faster

I also gave a senior -> staff case study on how I rebuilt the oncall in my Instagram org from the ground up. This is less directly applicable, but I do believe understanding the rhyme and reason behind what makes a good oncall helps you be more effective at oncall: [Case Study] Revamping Oncall For 20 Instagram Engineers - Senior to Staff Project
2
Brad Messer
•Senior Software Engineer at IBM
2 years ago
One thing I'm gonna ask is do you have a sense of which systems are more important than others? Do you know what should be a page? Do you know what can be left aside etc? These are questions you should be asking yourself when dealing with pages and not having enough bandwith. I'm taking over for a job right now where I literally got 40 pages just this past weekend and the team we're taking over for would just work all weekend. I'm not accepting of that and am helping the team drive down man-hours partially by finding things we can address and doing root-cause analysis and then also finding deficiencies in the product itself and working with all the other teams whose products we use to drive beneficial outcomes reducing man-hours on our side. This is all important for reducing working hours and ensuring partner teams are acting at top efficiency.