My team has been handed over a very big product, basically it's a transfer of ownership, and the product is critical to the company and is very big and complex. It's an ML based product with three big main components :
Each part feels like a sea of knowledge, I'm wondering how I can get a holistic understanding of how everything works. Also there is a lot of room for improvements, for example, the process for AB testing new combination of parameters for the ML models in production is a very manual thing (you have to open 3 PRs in different repos and just change some config files entries), wondering what's the best approach to improve this, as a lot of data scientists depend on this.
The first thing I'd do is define more clearly who owns the knowledge around all the various parts of this massive ML product. IMO, it'd be unreasonable to expect a single person to quickly understand and debug all 3 components of this system you're inheriting. If possible, set the expectation that the transfer of ownership will take a few months. A few questions to guide the transition:
With ownership transitions, what I've found to be more important than actually understanding the code/making improvements immediately, is to have a thorough plan to ensure nothing gets dropped. I think this probably consists of a few steps:
If you have these steps outlined and clearly communicated, I think you'll go a long way in building trust with customers/leadership, and the actual timeline on the improvements becomes more manageable.
for example, the process for AB testing new combination of parameters for the ML models in production is a very manual thing (you have to open 3 PRs in different repos and just change some config files entries)
I faced something very similar back at Instagram; it was literally 3 different changes across 3 separate config files to set up an experiment, haha. It was especially painful, because if you did some of them but forgot one, your entire experiment would be out of wack (really painful when you have already ran it for 2 weeks and are now realizing that the data is dirty).
At the end of the day, anything manual can be automated. I'm unsure how much credit you would get without your level, but automating this seems like a meaty win for a junior/mid-level engineer and a decent one for a senior engineer. Here are some ways of going about it, going from highest difficulty to lowest: