Modern Data Pipelines Testing Techniques: Why Bother? 3/3
A Visual Guide.
Book link: Modern Data Pipelines Testing Techniques on leanpub.
Fighting Against The Manual Auto-Pilot
Problem:
The ad hoc spot check is a bad habit. It starts during the exploration phase when you don’t know which tables should be used for this report. A couple of queries here and there. A couple of manual size checks. How many rows did I get this time? Did the data land in the correct folder? I guess I can ignore these pandas warnings?
If nobody depends on your data pipeline and you are the only short term user, the adhoc spot check is fine with me. The problem with this is that commercial data pipelines start as single user apps and some of them explode in popularity. Suddenly, the business finds the results of an analysis useful, and asks you to rerun the analysis for a different geo. Now you start scrambling for which notebook to use for the spot checks, and how to change the geo in the relevant places in the analysis. I am not…