Modern Data Pipelines Testing Techniques: Why Bother? 2/3
A Visual Guide.
Book link: Modern Data Pipelines Testing Techniques on leanpub.
Objections to TDD for Data Work
Problem:
When a new data dev catches the TDD bug and apply TDD for their data pipelines, they report the following objections:
- Data pipelines give non-deterministic results over time
- Creating test data is costly and discouraging
- Recreating the production environment is an expensive nightmare
These objections lead the data devs to cut corners. They lower the test coverage until it reaches a workable tradeoff. On one side of the tradeoff, the effort needed to set test environments. On the other side, the confidence level that their business use case requires.
The core issue is that the business wants the data to be reliable, fresh, and actionable but inexpensive to generate and maintain. This tradeoff is central to the objections you will tell yourself next time you drift away from TDD for your next data project. Unfortunately, new data devs do not have suitable mental models to explore this tradeoff. What kind of tests do I need? What are the testing components that I can skip, given the business importance of the data? How to set up the bare minimum testing infrastructure required to reach the data reliability the business needs?
Complications:
On top of the classic cost vs. reliability tradeoff, technical complexity gets layered on top of that. When adopting TDD+CICD, Data devs face too many decisions at once. How do I get started? What tools, infrastructure, and platform do I need to learn to do this right? How long will it delay my initial release?
Guidance:
As usual, the first step is raising awareness. Having someone tell you that it is normal. All new data devs feel unsure about applying TDD to data pipelines.
The advice here is to learn the skills required for TDD, outside the critical path of your current business process. Take a less important, low-intensity data pipeline and experiment with the concepts in this book. Rewriting an existing data pipeline with TDD+CICD in a parallel data environment is a great starting point. It enables the data dev to learn about the technical practices in a known business domain.
Sources of Data Validation Complexity
Problem:
I would bet that many data devs share this view: The inherent lack of control over the input data is the most annoying part of any data pipeline. The fact that a perfectly fine data transformation works on day D1 and breaks on D2, only to work again on D3 can be infuriating. Data pipelines routinely see this happening when a single day of data is bad because of upstream issues. But the rest of the days are fine because the upstream team fixed the issue, and things are well again. Sadly, new devs will not protect the data pipeline from this ahead of time. They never saw it happen, so why bother?
Complications:
In addition to the lack of control over the input, the validation complexity is highly variable. It is one thing to fix a bug with known good data on a codebase we are familiar with. But it is another thing to deal with missing data. Missing data is especially weird when you have missing data, but you don’t know that you have missing data. As a result, detecting data completeness relies on heuristics that are disheartening for new data devs. These data devs find it to be the perfect excuse for “nobody told me that the distinct values of this field will change.”
But the source data is one of many culprits. We could also have misunderstood the expected output of our own data transforms. Unexpected outputs are especially common for modeling apps that are trained on biased datasets. So they expose the bias in the output predictions.
Guidance:
The guidance here is to build as good of an understanding of the source data as possible. This is done by talking to the producers and collaborating on data quality contracts. By the same token, the data pipeline under test also has consumers. Indeed, building data quality contracts with consumers is also recommended. This proactive behavior will positively surprise the data consumers. This surprise is because so few data dev teams invest in these data contracts due to the inherent constraints they include. But again, longer-term, you will find that building agreement between your producers and your teams and between your team and your consumers will benefit everybody.
The Data Product Promise No One Can Keep
Problem:
“After almost getting fired because of the previous incident with the exec revenue dashboarding pipeline, I will only work on testable data pipelines from now on,” has been said and will be said many times. When new data devs encounter their first data bug in production, they decide to push back on any new data product requirements. No more work until they get full assurance about the data sources and the required data outputs. “No more dealing with flaky data sources that we have no control over.” “No more assuming that we understand what the customer wants.” These are early signs that the data devs are fed up with dealing with so much ambiguity.
Complications:
The reality is that commercial interests will have higher priority than code quality. Besides, platform changes will always happen, and technical deficiencies will always exist. There is no escape from constant change in the software world. As a result, optimizing our work for learning about the business domain is key. Building delivery mechanisms with fast feedback is critical. Using repeatable engineering practices is the best we can do at this point (the year is 2023). Advances in general artificial intelligence and quantum computing might change that in the future. But for the time being, data engineering is still an engineering profession. Engineering always deals with trade-offs.
Guidance:
The best practices known to humans today for dealing with “not-fully-testable” environments are summarized in the DORA research about effective software teams [3]. Testing is at the core of much of the automation prescribed by the DORA research. We can release millions of times per day, but how do we know if that code works? How do we reduce the lead time between requirements and delivery if the system is a tangled mess of untestable code? How do we quickly recover from data downtime if we cannot test our fixes quickly and reliably? How do we reduce our change failure rate if one fix in one part of the pipeline leads to another failure in an unrelated part of the data pipeline?
The most efficient way to build data pipelines we know of is through disciplined automated testing. This is quite a tall order. Testing data pipelines is at the edge of human knowledge. Thus, only a few new data devs have all the skills necessary to tackle this challenge.
Therefore, the majority ends up giving up on automated testing. Don’t join their ranks. Starting small and growing your testing skills iteratively, in small steps, and over time is the best current guidance I can give to work in a “not-fully-testable” data environment. Giving up on testing and reverting to manual testing will only make you bitter. Bitter toward your data sources, pipeline code, and data outputs. Manually validating the quality of your data pipelines will make you mentally burnt out. Having to explain and apologize to your stakeholders about the more recent data downtime gets old quick.
Want to learn more about modern data pipelines testing techniques?
Checkout my latest book on the subject. This book gives a visual guide to the most popular techniques for testing modern data pipelines.
2023 Book link:
Modern Data Pipelines Testing Techniques on leanpub.
Book link: Modern Data Pipelines Testing Techniques on leanpub.
See ya!
Disclaimer: The views expressed on this post are mine and do not necessarily reflect the views of my current or past employers.