Taro Logo

How to form a healthy on-call rotation?

Profile picture
Anonymous User at Taro Community3 months ago

My manager and I have been owning the on-call rotation for the backend/platform for my company's flagship product that we launched recently. The rotation of 2-3 engineers is hectic and overwhelming, and my manager and I have brought up this issue, and finally got the acknowledge from the rest of the organization that more engineers needed to be added into the on-call rotation to form a healthy on-call? Is 8-10 engineer on-call rotation a healthy rotation?

10 Likes
428 Views
6 Comments

Discussion

(6 comments)
  • Rahul Pandey
    Meta, Pinterest, Kosei
    3 months ago

    2-3 engineers in an oncall is unhealthy IMO. The rule should be that the knowledge/context on the team should still survive if 2 people are not available (e.g. one person on vacation, and one person who quits). So having 2 people total in one rotation is definitely not good.

    Of course, this depends on the nature of the oncall. Do you have a log of what issues the oncall deals with, and how much time it requires? If it's a low-stakes oncall, e.g. just updating documentation, it may be fine. But for something that directly impacts production, 8-10 people is much better.

    Also see: https://www.jointaro.com/question/bzMluuZEOPJPKz9gzLj9/does-being-on-call-during-early-career-teach-you-things-or-allow-you-to-grow-technically/

    8 Likes
  • Anonymous User
    Anonymous User [OP]
    Taro Community
    3 months ago

    I have complied a documentation for support run-book log for each corresponding issue/alert, so the the on-call team understand what's the level of severity of impacts to business. As the range of the issue can range from low priority to business critical.

    However, The support run book doc is not complete as the ultimate source of truth since the production system support behaves like triage rather than debug system.

    Then, the nature of on-call rotation can change from to acknowledge the alerts and follow the steps of documentation that support run book documentation to working with business owners. And, there's few time the issue is caused by another team or 3rd party vendor, and it's an issue that cannot be solved by the on-call person.

    I am interested to learn more about what other people view about healthy on-call rotation.

    It seems like there are several factors to form a healthy on-call.

    • Team Size & number of engineers avaliable
    • team maturity
    • Well maintained Documentation

    I see that there's few online post here:

    4 Likes
  • Roger Hu
    Engineering Manager
    3 months ago

    Yes, 2-3 people is going to be overwhelming. I think you need at least 4 to make it sane (once a month) with a manager serving as a backstop. Will Larson endorses the notion of 8 people (https://lethain.com/sizing-engineering-teams/). I would also take a hard look at the seniority of your team to make sure you have engineers who are independent enough to deal with ambiguous questions and debug issues that cut across multiple domains/systems. Runbooks and documentation -- not to mention shadowing are important here too in making sure there is sufficient training and expertise to not have issues escalate to other engineers who are trying to get project work done.

    7 Likes
  • Alex Chiou
    Tech Lead @ Robinhood, Meta, Course Hero
    a month ago

    The rotation of 2-3 engineers is hectic and overwhelming

    2-3 engineers is way too few for a full, healthy oncall rotation. At this point, you should just do the "traditional" way where when an issue comes in, it just goes to whoever owns the code behind the issue.

    I am surprised that a rotation of 2-3 engineers is hectic though: The amount of surface area owned across just 2-3 people should be low. From this, 1 of 2 things is probably true:

    1. Despite not having a lot of code, there's a ton of technical debt and fragile system design already in what you own
    2. There's a lot of engineers who are connected to the surface area of the oncall's issues, but they aren't on the oncall (so they're writing code that breaks which they don't need to own)

    Is 8-10 engineer on-call rotation a healthy rotation?

    Generally yes. I think 7-12 is the sweet spot for oncall size from my experience.

    Here's some other resources on oncall, which may be helpful:

    2 Likes
  • Anonymous User
    Anonymous User [OP]
    Taro Community
    a month ago

    Over the past three months, my manager (Director) has repeatedly told me that we should have at least six full-time equivalent (FTE) engineers to handle on-call support. Unfortunately, he has been unable to deliver on this promise due to push-back from other teams. Currently, my director is still part of the on-call rotation and he told me that he would like to be relieved from on-call duties. I am considering what other actions I should take action to establish a more sustainable and healthy on-call rotation.

    My manager told me that his ideal goal is to have on-call rotation for every 3 month, but the math currently does not add up. Any thoughts?

    1 Like
  • Alex Chiou
    Tech Lead @ Robinhood, Meta, Course Hero
    a month ago

    I am considering what other actions I should take action to establish a more sustainable and healthy on-call rotation.

    If you aren't able to get the bodies, I just don't think you should have a formal oncall rotation. The ideal scenario here is to have a TPM who's decently good at routing fires and bugs to the proper owner.

    In the meantime, focus on improving the system. Oncalls aren't inherently hectic - They become overwhelming when the system quality is poor. If you're getting a lot of issues, I would try to figure out the common root causes and fixing them. Maybe even dedicate an entire sprint towards "Better Engineering" to make the system more reliable and break less.

    2 Likes