2

How to write clean code when working with ML scripts/functional programming?

Profile picture
MLE at Taro Community7 months ago

In my company (early stage so there is not a formal code standard, this is also not yet production code it is R&D code) the ML code for preprocessing the data is a long list of function calls, ETL, pandas stuff to manipulate, clean, and process the raw data. Then there's obv a ton of other function calls for ML -- split data, remove missing vals, run it thru ML models and call test methods

This makes it hard to write clean code bc

  1. There is very little code reuse to make stuff modular
  2. In most SWE cases there are increasing layers of abstraction. But here there is no obvious way to abstract away stuff since most of it is a long list of sequential processing code

Things I'm doing

  • Breaking into classes (1 for model, 1 for preprocess, 1 for tests, etc...). But even these are just becoming long functions or a python file with 20 functions
  • Trying to make the scripts function based so that when I make a function call it is clear what this function is supposed to do
36
2

Discussion

(2 comments)
  • 2
    Profile picture
    Tech Lead @ Robinhood, Meta, Course Hero
    7 months ago

    There is very little code reuse to make stuff modular

    Code doesn't have to be reused to become modular. My definition of modular is any code that's split up into reasonably sized, focused components. It doesn't even have to be OOP if your language doesn't support it - Just split things into separate files with good names.

    In most SWE cases there are increasing layers of abstraction. But here there is no obvious way to abstract away stuff since most of it is a long list of sequential processing code

    Increasing layers of abstraction isn't a good thing IMHO - It just feels like a good thing because it's generally the sign of a growing, successful company (otherwise, why would your codebase be so big?). But at the end of the day, nobody likes jumping through 8 layers of function calls to understand how something works. Don't feel pressured to add more layers of abstraction to your code - In fact, fight for the opposite.

    All that being said, you can always add a layer of abstraction, even if the code is sequential. Here's a basic example:

    func makeHamburger(bun: Bun) {
      val patty = getGrilledPatty()
      addPatty(bun, patty)
      addCheese(bun)
      addTomato(bun)
      addLettuce(bun)
    }
    

    Making a hamburger is a sequential process, and I introduced 1 layer of abstraction by moving each individual step into its own method.

    Anyways, as I mention in my "Code Code Quality Isn't Static" lesson for my code quality course, the level of thoughtfulness you put into your code depends on the stakes. For startups, the stakes are relatively low as you're probably the only engineer working on this and the amount of end users is in the thousands, maybe even just the hundreds. As long as the core functionality works and the code is relatively clean via basic clean code tactics (like the ones you're applying, nice job!), that's good enough.

    A lot of code quality is playing things by ear, especially in startups.

    • New engineer joins your team and can't make heads or tails of your code? Okay, maybe you can refactor it so it's easier to understand.
    • The userbase behind your product jumps 100x and now they're finding new edge cases left and right? Okay, let's document all these issues, find patterns, and see how we can reinforce the code so it stops falls into these common bug patterns (maybe add some unit tests).
  • 1
    Profile picture
    Fractional CTO, Board Advisor, & VC Tech Advisor
    7 months ago

    One of the important things for ML workflows in particular is how you structure the entire pipeline. It isn't to say that it still won't be a mess at some level, but it'll be better organized than most. I usually just go with my gut on what the structure should look like, check with the team to make sure it's ok, and then just put up the PR. Usually doesn't take too long if handled correctly.