Need advice on writing clean code for ETL pipelines

Question

a lot of my work depends on writing pandas/ETL pipelines in Python

pain point #1: I find that my functions (each logic block is a function) tends to be hardcoded. How to avoid this?

Example:

def agg(df):

    df['A'] = df['B'].groupby(['C']).apply(function)
    # 10 more lines of specific business logic

    return df

def remove_duplicates(df):

     # gets a list of existing IDs from database and removes them from this ETL pipeline

     existing_rows = get_existing_ids_from_db()

     df = df[~df['id'].isin(existing_rows)]

     return df
def get_existing_ids_from_db():
    return set(db.query(Data.id).all())

As you can see the issue is that a lot of these functions tend to be hardcoded. Thus its hard to

Isolate and verify
Extend
reuse

But when I make it less hardcoded, it ends up being more redundant and thus hard to maintain because if i change the logic in 1 place I need to change elsewhere (eg, remove all duplicates except this super admin) for pipelines A,B,C but not D

This is a simplified example, but illustrates a larger point that my gut instinct is to write a pipeline, then isolate logic into functions, then I realize damn now my logic is all hardcoded when I want to reuse this with slightly different cols/logic

pain point #2: when I pass in data frames into a function it makes the code hard to maintain as now bc this function expects a very specific dataframe with a specific set of columns. How should I be thinking about designing functions here

Additional ask, any suggestions on resources for good code for ETL/data engineering?

Alex Chiou · Accepted Answer

First, there's nothing wrong with copy-paste. I feel like a rite of passage for most engineers is realizing that seeing the same code twice (or more) isn't the end of the world. The problem with abstracting away logic into common areas is as you mentioned:

Use-case 1 wants to modify A, B, C, and D
Use-case 2 wants to modify just A, B, and C
If the 2 use-cases share some method like processPipelines() that touches A, B, C, and D, you break Use-case 2

When moving code into common places, it's a spectrum of granularity:

Most granular - Moving tiny pieces like hard-coded strings into constants
Medium granularity - Taking specific slices of business logic and putting them into methods like processPipelineA()
Low granularity - Moving almost all business logic into a common class/method like processAllPipelines()

Figure out what level you can take on that balances reusability and scalability (without breaking things and needing to write a bunch of if-statements).

Also, you will often need to write specific business logic.

I don't have the full context of your code and I'm not an ETL specialist, so I made guesses where I could. Hopefully at least some of this made sense.

Lastly, AI can be helpful with this if given enough context. It will rarely give you a great answer off the bat, but it's useful for bouncing ideas off of and then molding what inspiration they give you.

Need advice on writing clean code for ETL pipelines

Discussion

Related Course

Other Great Discussions