A Visual Guide.
Book link: Modern Data Pipelines Testing Techniques on leanpub.
Fighting Against The Manual Auto-Pilot
The ad hoc spot check is a bad habit. It starts during the exploration phase when you don’t know which tables should be used for this report. A couple of queries here and there. A couple of manual size checks. How many rows did I get this time? Did the data land in the correct folder? I guess I can ignore these pandas warnings?
If nobody depends on your data pipeline and you are the only short term user, the adhoc spot check is fine with me. The problem with this is that commercial data pipelines start as single user apps and some of them explode in popularity. Suddenly, the business finds the results of an analysis useful, and asks you to rerun the analysis for a different geo. Now you start scrambling for which notebook to use for the spot checks, and how to change the geo in the relevant places in the analysis. I am not saying that this happens everytime. But even during your exploration activities you might want to invest a little bit of time on automating the testing of your data pipeline.
In addition to the business interest, your time is valuable. When developing a data pipeline, the ad hoc spot check of the outputs will need you to rerun the full pipeline every time. Running the entire pipeline will require longer running time because you are ingesting the whole data set. Running all the transformations will be slow. Outputting the resulting full dataset will devour your time. This process is arduous, and without testing automation, you might need to rerun it many times. Checking that the full pipeline completes after a small code change in step 4 of 6 should not be the default. Splitting the pipeline into meaningful chunks that separate the outside world from the domain logic would be nice. And wouldn’t it be nice to test out the…