1

How can I own a complex product that is handed to my team fast?

Profile picture
Anonymous User at Taro Community2 years ago

Hi,

My team has been handed over a very big product, basically it's a transfer of ownership, and the product is critical to the company and is very big and complex. It's an ML based product with three big main components :

  • multi stage data pipelines (google cloud)
  • ML models trained recurrently (Kubeflow, google cloud)
  • Model serving (custom grpc services in golang, google cloud)

Each part feels like a sea of knowledge, I'm wondering how I can get a holistic understanding of how everything works. Also there is a lot of room for improvements, for example, the process for AB testing new combination of parameters for the ML models in production is a very manual thing (you have to open 3 PRs in different repos and just change some config files entries), wondering what's the best approach to improve this, as a lot of data scientists depend on this.

41
2

Discussion

(2 comments)
  • 2
    Profile picture
    Meta, Pinterest, Kosei
    2 years ago

    The first thing I'd do is define more clearly who owns the knowledge around all the various parts of this massive ML product. IMO, it'd be unreasonable to expect a single person to quickly understand and debug all 3 components of this system you're inheriting. If possible, set the expectation that the transfer of ownership will take a few months. A few questions to guide the transition:

    • How long can you expect support from the new team? What about their in-flight projects?
    • On the new team, can you allocate people to specialize in parts of the system?
    • What runbooks exist already for each component? (you should create them if they don't exist)

    With ownership transitions, what I've found to be more important than actually understanding the code/making improvements immediately, is to have a thorough plan to ensure nothing gets dropped. I think this probably consists of a few steps:

    • Ownership transition
    • Handling support requests/maintenance burden (sounds like lots of data scientists already use this infra)
    • Fixing bugs -- create prioritized list
    • List of improvements -- again, a prioritized list is important here

    If you have these steps outlined and clearly communicated, I think you'll go a long way in building trust with customers/leadership, and the actual timeline on the improvements becomes more manageable.

  • 2
    Profile picture
    Robinhood, Meta, Course Hero, PayPal
    2 years ago

    for example, the process for AB testing new combination of parameters for the ML models in production is a very manual thing (you have to open 3 PRs in different repos and just change some config files entries)

    I faced something very similar back at Instagram; it was literally 3 different changes across 3 separate config files to set up an experiment, haha. It was especially painful, because if you did some of them but forgot one, your entire experiment would be out of wack (really painful when you have already ran it for 2 weeks and are now realizing that the data is dirty).

    At the end of the day, anything manual can be automated. I'm unsure how much credit you would get without your level, but automating this seems like a meaty win for a junior/mid-level engineer and a decent one for a senior engineer. Here are some ways of going about it, going from highest difficulty to lowest:

    1. Have an internal tool that submit all 3 PRs at once - This one is ideal: You just take the parameters of your A/B test, enter it in, and it generates the PRs + links them.
    2. Have automated reminder text on PRs - The idea is that when you submit one, it asks if you have submitted the others and links those proper repos.
    3. Create an A/B test checklist - I'm unsure about task management software you use, but the idea is that if someone makes a JIRA ticket or something to run an A/B test, it adds a comment or something reminding the task owner to make the 3 PRs (and links them the proper repos).