It's oncall again and I'm unsure how to be the most productive - Any tips?

Profile picture
Anonymous User at Taro Community2 months ago

I'm on-call this week and I'm met once again with a strong sense of not having a clue what I'm doing. I know the majority of the job of engineering is trying to work within a legacy system but I feel like I'm missing tactics to help me make any progress at all. That and having to switch contexts to address alarms and queries has meant that I actually made no progress today whatsoever so this is part rant and part question - any tips on confronting a wall of overwhelm and making progress towards the most impactful bugs?

5 Likes
137 Views
3 Comments

Discussion

(3 comments)
  • Brad Messer
    Senior Software Engineer at IBM
    2 months ago

    One thing I'm gonna ask is do you have a sense of which systems are more important than others? Do you know what should be a page? Do you know what can be left aside etc? These are questions you should be asking yourself when dealing with pages and not having enough bandwith. I'm taking over for a job right now where I literally got 40 pages just this past weekend and the team we're taking over for would just work all weekend. I'm not accepting of that and am helping the team drive down man-hours partially by finding things we can address and doing root-cause analysis and then also finding deficiencies in the product itself and working with all the other teams whose products we use to drive beneficial outcomes reducing man-hours on our side. This is all important for reducing working hours and ensuring partner teams are acting at top efficiency.

    1 Like
  • Rahul Pandey
    Tech Lead/Manager at Meta, Pinterest, Kosei
    2 months ago

    This is not a direct answer to your question, but is there a ritual in your team/company around documenting what the on-call experience was like for each person?

    If not, I highly recommend starting a "hand-off" system which includes a few bullet points about the week. There are 2 huge benefits:

    1. You start to see that you're not alone in feeling lost or scatter-brained. When other people do their retrospective, you can ask them (or read their notes) how they did various tasks and where they spent their time.
    2. The retrospectives shine a light on potential problems, and then you can decide as a team whether (1) to do additional training to handle common scenarios or (2) build tooling to make life easier for the on-call.
    2 Likes
  • Steve Huynh
    Principal Software Engineer at Amazon
    2 months ago

    You have two obligations while you are oncall:
    1. Field tickets.
    2. Effect change so the system that you own runs better (and therefore generates less tickets).

    If you're working on a bug but continually get pulled off, is it an operational bug, i.e. the source of the alarms that are going off? Or is it an an impactful bug but doesn't get in the way of operations? Maybe it's a CX defect that doesn't generate tickets. If it's the former, maybe your team is getting buried in ops, and it might make sense to do an ops sprint to clear things out. If it's happening to you it's likely happening to your team members. If it's the latter, STOP. You should only be doing 1 and 2 above. Non-urgent bug fixes should not be worked on by the oncall.

    7 Likes