I find that much/most of my time spent working is on data validation. Not only is this unfun and unglamourous, but it's also generally unnecessary. Sometimes I find a real issue in the data, but usually I don't. I think it's better for me to do less data validation for a few reasons: 1) often the issues that I think exist turn out not to be issues 2) the issues can often be fixed later (if anyone notices them at all) 3) the issues don't necessarily need to be addressed by me - in fact I'd rather them be addressed by others I think my desire for the data to be right going comes from both perfectionism (getting it right the first time) and a misplaced desire to do good work. I'll give a couple examples to illustrate my struggle: 1) When I'm creating an ingestion for my production database, I want the ingestion to go smoothly, particularly because some of our production tables are audited as they deal with financial data. But our dev and uat environments contain garbage data compared to prod so it's hard for me to be certain that my pipeline works using them. I can and often do try and get dev and uat to mirror prod as much as possible, but it takes work and effort. 2) I went through the process of replacing a file that comes from one source today with the same file from another source that my team controls. Because this file is important, we want to make sure that the data in it is the same as the file we were replacing. The problem comes from the fact that the farther you go back in time, the more (valid) reasons there can be for the files being different. All this to say, I really think it pays to be lazier regarding data quality because certainty is so expensive. I also don't think people care for the most part and if they do care, it's usually fixable later on. OK, rant over. If anyone has tips for helping me do less data quality, I'd appreciate it.

I've only dealt with a few data pipelines in my career, but some things to consider: Be clear about the cost of 'screwing up'? If the cost of a mistake is not high, you can move a lot faster without needing to triple-check everything. Write that down and communicate the remediation plan with people so it's expected. If there is a mistake, how quickly can you identify the right person/team who should take a look? Are there tools you can use to provide extra confidence? Ideally, you want some sort of checklist that you can go through, which is a bit more sophisticated than 'doing an eyeball check.' It could be as simple as a linter or running it on the dev environment. But our dev and uat environments just curious, what is a uat environment?

How to do Less Data Validation (Avoid Perfectionism)

Data Engineer at Financial Companya year ago

I find that much/most of my time spent working is on data validation. Not only is this unfun and unglamourous, but it's also generally unnecessary. Sometimes I find a real issue in the data, but usually I don't.

I think it's better for me to do less data validation for a few reasons:

often the "issues" that I think exist turn out not to be issues
the issues can often be fixed later (if anyone notices them at all)
the issues don't necessarily need to be addressed by me - in fact I'd rather them be addressed by others

I think my desire for the data to be right going comes from both perfectionism (getting it right the first time) and a misplaced desire to do "good" work.

I'll give a couple examples to illustrate my struggle:

When I'm creating an ingestion for my production database, I want the ingestion to go smoothly, particularly because some of our production tables are audited as they deal with financial data. But our dev and uat environments contain garbage data compared to prod so it's hard for me to be certain that my pipeline works using them. I can and often do try and get dev and uat to mirror prod as much as possible, but it takes work and effort.
I went through the process of replacing a file that comes from one source today with the same file from another source that my team controls. Because this file is important, we want to make sure that the data in it is the same as the file we were replacing. The problem comes from the fact that the farther you go back in time, the more (valid) reasons there can be for the files being different.

All this to say, I really think it pays to be lazier regarding data quality because certainty is so expensive. I also don't think people care for the most part and if they do care, it's usually fixable later on.

OK, rant over. If anyone has tips for helping me do less data quality, I'd appreciate it.

1717 Views

33 Comments

Discussion

(3 comments)

0
Rahul Pandey
•Tech Lead/Manager at Meta, Pinterest, Kosei
a year ago
I've only dealt with a few data pipelines in my career, but some things to consider:

Be clear about the cost of 'screwing up'? If the cost of a mistake is not high, you can move a lot faster without needing to triple-check everything. Write that down and communicate the remediation plan with people so it's expected.

If there is a mistake, how quickly can you identify the right person/team who should take a look?

Are there tools you can use to provide extra confidence? Ideally, you want some sort of checklist that you can go through, which is a bit more sophisticated than 'doing an eyeball check.' It could be as simple as a linter or running it on the dev environment.

But our dev and uat environments

just curious, what is a uat environment?
- 0
  Data Engineer [OP]
  •Financial Company
  a year ago
  I mean, uat stands for "user acceptance testing", but you probably already know that. I'm under the impression that it's common to have 3 environments normally: a production one, a staging/testing/uat one, and a development one. In my previous company where I was a backend dev at a startup, there was actually a 4th env as people would copy the dev database to their local env and work there.
  
  Is a 3-env setup not standard practice?
0
Thoughtful Tarodactyl
•Taro Community
a year ago
I dont have much to add but I want to say I have been in the same position of unnecessarily testing data infra and it was one of the most painful jobs I did. I quit that job and im much happier at my new place

I realized I was not senior enough to be able to change the teams processes and the scope of the work was terrible so moving out of the job was the best option