I work for a company that offers online web and mobile apps for US-based customers. As part of a recent re-organization, all mobile, web, and backend engineers have been combined into a single on-call rotation. Even though most of these 20+ engineers (mobile + Web engineers) have not much context about the backend system, my director wants to alleviate the frequent on-call rotation, and he proposes having a healthy size of on-call rotation that uses the "follow the sun" model approach, which involves training engineers in different time zones to have knowledge transfer about the backend system and potential issues. I'm curious to know how I can effectively onboard and train over 20 web and mobile engineers for the on-call rotation while following this model.
The Backend team has compiled a comprehensive support run-book log for each corresponding issue/alert, which shows the severity, priority, and range of the issue. The on-call rotation involves acknowledging alerts and following the steps outlined in the run-book.
Please note that the support run-book is not a 100% comprehensive source of truth since the production system is integrated with multiple 3rd party APIs and systems, and the backend platform serves as middleware for both mobile and web applications. There may be instances where issues are caused by third-party vendors and cannot be solved by the on-call person.
I would love to hear your thoughts and perspectives on this matter. I'm also meeting with my boss for our one-on-one to talk about his idea. This is still an experiment, but would like to get people's perspective. Thank you!
Here's a response from ChatGPT:
would advise the following steps for effective onboarding and training of 20+ engineers for production on-call support:
Create a comprehensive training plan: Develop a training plan that covers all aspects of the backend platform, including its architecture, systems, and processes. Ensure that the training plan is comprehensive and covers everything the engineers need to know about the platform.
Provide hands-on training: Provide hands-on training sessions where engineers can work on the platform and get familiar with its features, processes, and tools. This will help them understand the system better and be prepared to handle any issues that may arise.
Use real-life examples: Use real-life examples to illustrate the concepts you are teaching. This will help the engineers better understand the platform and how to handle real-life scenarios.
Encourage collaboration: Encourage collaboration among the engineers and encourage them to ask questions and provide feedback. This will help build a strong team and create a supportive learning environment.
Ensure documentation is up-to-date: Ensure that the support run-book documentation is up-to-date and includes all relevant information about the platform, its systems, and processes. This will help the engineers quickly resolve any issues that may arise.
Provide ongoing support: Provide ongoing support to the engineers even after the training is complete. This will help ensure that the engineers are confident and prepared to handle any issues that may arise.
While I'm no expert, I think that it sounds like your team is going about it (broadly) in the right direction. Generally speaking, introducing more engineers and shifting the oncall rotations to geographical timezones (aka follow the sun) are the two most impactful things that you can do to reduce oncall stress outside of directly reducing the number of alerts.
Here's my thoughts:
Bias towards experience, not training
You'll want to bias towards getting engineers to a "hands on keyboards" stage as quickly as possible - i.e. actually being the primary contact for an oncall rotation. This is because training will only ever get you so far, real experience and ability comes from having skin in the game. First oncall shifts will always be messy/have mistakes, so this only gets delayed by having longer training periods.
Shadowing then implementing
It can often be difficult to properly explain/document how you found an issue or a bug, particularly when it might not be easily replicable. But you want to have this data for new engineers so they understand what led you to certain decisions when they get a similar page. So the solution is to have them shadow you - whenever you get an alert that's not just, pull the new engineer into a call, screen share, and begin to resolve the issue. Bonus points if you instead have them screen share and you tell them the commands to run - they'll remember these commands much better than whatever you show them.
Pair new engineers up with experienced engineers
Ensure that the secondary engineer is one that's already experienced/knowledgeable about your team's oncall practices, and then encourage the new engineer to frequently check in with the more knowledgeable one if they're not sure about anything. This way you can be sure they're running the right commands/looking in the right places, and correct them or readjust their approach as needed.
Have new engineers write/rewrite the runbooks
Chances are at least some of your runbooks will be incorrect - they'll be out of date, or operate under certain assumptions about the reader, like the assumption that the reader knows where certain passwords are stored, or what certain services are called. When you run into these problems with new engineers, provide them with the necessary knowledge (like the service name) and then have them update the runbook. This way, the new engineer gains a deeper understanding of your architecture through the rewrite/update, and you open up the possibility to find other problems with the runbook(s)
In Short: Think of this in the same way we think about development cycles - we always want to have the shortest time possible between development and getting the feature into prod, to better find bugs/iterate on the idea. Likewise, it's better to get new oncall engineers into the rotation ASAP, then resolve the problems that arise.