How to effectively onboard and train 20+ engineers for production on-call support?

Question

Hi everyone,

I work for a company that offers online web and mobile apps for US-based customers. As part of a recent re-organization, all mobile, web, and backend engineers have been combined into a single on-call rotation. Even though most of these 20+ engineers (mobile + Web engineers) have not much context about the backend system, my director wants to alleviate the frequent on-call rotation, and he proposes having a healthy size of on-call rotation that uses the "follow the sun" model approach, which involves training engineers in different time zones to have knowledge transfer about the backend system and potential issues. I'm curious to know how I can effectively onboard and train over 20 web and mobile engineers for the on-call rotation while following this model.

The Backend team has compiled a comprehensive support run-book log for each corresponding issue/alert, which shows the severity, priority, and range of the issue. The on-call rotation involves acknowledging alerts and following the steps outlined in the run-book.

Please note that the support run-book is not a 100% comprehensive source of truth since the production system is integrated with multiple 3rd party APIs and systems, and the backend platform serves as middleware for both mobile and web applications. There may be instances where issues are caused by third-party vendors and cannot be solved by the on-call person.

I would love to hear your thoughts and perspectives on this matter. I'm also meeting with my boss for our one-on-one to talk about his idea. This is still an experiment, but would like to get people's perspective. Thank you!

Scott Gardner · Accepted Answer

While I'm no expert, I think that it sounds like your team is going about it (broadly) in the right direction. Generally speaking, introducing more engineers and shifting the oncall rotations to geographical timezones (aka follow the sun) are the two most impactful things that you can do to reduce oncall stress outside of directly reducing the number of alerts.

Here's my thoughts:

Bias towards experience, not training

You'll want to bias towards getting engineers to a "hands on keyboards" stage as quickly as possible - i.e. actually being the primary contact for an oncall rotation. This is because training will only ever get you so far, real experience and ability comes from having skin in the game. First oncall shifts will always be messy/have mistakes, so this only gets delayed by having longer training periods.

Shadowing then implementing

It can often be difficult to properly explain/document how you found an issue or a bug, particularly when it might not be easily replicable. But you want to have this data for new engineers so they understand what led you to certain decisions when they get a similar page. So the solution is to have them shadow you - whenever you get an alert that's not just, pull the new engineer into a call, screen share, and begin to resolve the issue. Bonus points if you instead have them screen share and you tell them the commands to run - they'll remember these commands much better than whatever you show them.

Pair new engineers up with experienced engineers

Ensure that the secondary engineer is one that's already experienced/knowledgeable about your team's oncall practices, and then encourage the new engineer to frequently check in with the more knowledgeable one if they're not sure about anything. This way you can be sure they're running the right commands/looking in the right places, and correct them or readjust their approach as needed.

Have new engineers write/rewrite the runbooks

Chances are at least some of your runbooks will be incorrect - they'll be out of date, or operate under certain assumptions about the reader, like the assumption that the reader knows where certain passwords are stored, or what certain services are called. When you run into these problems with new engineers, provide them with the necessary knowledge (like the service name) and then have them update the runbook. This way, the new engineer gains a deeper understanding of your architecture through the rewrite/update, and you open up the possibility to find other problems with the runbook(s)

In Short: Think of this in the same way we think about development cycles - we always want to have the shortest time possible between development and getting the feature into prod, to better find bugs/iterate on the idea. Likewise, it's better to get new oncall engineers into the rotation ASAP, then resolve the problems that arise.

How to effectively onboard and train 20+ engineers for production on-call support?

Discussion

Other Great Discussions