Modern Data Pipelines Testing Techniques: Why Bother? 1/3

Moussa Taifi PhD
8 min readJan 2, 2023

A Visual Guide.

2023 Book link: https://leanpub.com/moderndatapipelinestestingtechniques/

Book link: Modern Data Pipelines Testing Techniques on leanpub.

Part 1/3
Part 2/3
Part 3/3

Data Pipeline Transitive Failure Modes: The Reality Check

Data Pipeline Transitive Failure Modes

Problem:

New data devs do not realize that data pipelines have transitive failure modes. These failure modes are not clear to the untrained eye. After the first few months of fighting debugging battles against things like:

  • “Is this number correct?”
  • “Why is the dashboard not updated for yesterday’s data?”
  • “How come we didn’t catch this bug upstream?”

They finally realize that there are many types of failures that were not covered in the 6-week bootcamp to get this job. Besides, new data devs find that data producers and consumers are continuously dealing with business change. Unfortunately, these new data devs are ill-equipped to deal with change, like we all were when we started our careers in this data thing.

Complications:

The first thing new devs usually miss is a global view of the “data pipeline.” They are assigned to this easy-enough task to add a new aggregate value to a mid-level table. Unfortunately, that task requires them to rename a field to reflect the nature of the new aggregation. Renaming that field will break the data consumers. This transitive failure is the first of many silent new bugs they will need to painfully revert and backfill.

Guidance:

Removing this lack of visibility is quite a challenge in large organizations. Yet, realizing the types of transitive data failures should help. The top learned-on-the-field types are:

  • Silent Data Bug
  • Code Bug
  • Late Data Runs
  • Stale Dashboards and Data APIs

The Data Pipeline Transitive Failure Modes diagram, adapted from [1], shows a data bug in the source system. This data bug trickles down to all the related downstream datasets. without firing a single alarm. Unfortunately, undetected data bugs are one of the hardest to catch without solid data quality gates.

Also, we see a code bug in one of the intermediate steps that made a whole job fail. This job is a dependency of two other jobs downstream. The job workflow scheduler detected this missing dependency and did not schedule the two downstream jobs. With adequate monitoring, this type of error will alert the data dev.

Sadly, the alert usually comes from the stakeholder directly! “Why is the dashboard not updated for yesterday’s data?” is the usual complaint. As a result, the data dev loses some credibility with the stakeholder. Then our data dev spends many hours fixing the bug and re-running all the jobs needed to catch up to the latest hours of available data.

Having a visual of the data pipeline’s transitive failure modes helps place the current task in the grand scheme. Furthermore, being aware of types of data pipeline failures will motivate new data devs to inject quality in their work. The data devs will hopefully see the value of instrumentation and testing to avoid these issues in the future.

Bad Data Devs Lifestyle

Bad Data Devs Lifestyle

Problem:

“Time for some Email/Instagram/Twitter/Youtube/Reddit” is the typical reaction to waiting for a long data pipeline to run to completion. “Why wait and babysit this pipeline when I could be doing something else,” said every new data dev ever.

Sadly, this is the state of affairs for any new data dev. This sad state of affairs stems from their need for more modern, humane, and productive techniques for developing data pipelines.

The Bad Data Devs Lifestyle diagram shows how your average data dev spends time working on a data pipeline. This includes coding, waiting for feedback, and deciphering the feedback.

What we want is to maximize fast feedback. Unfortunately, this is not the case. When running an entire end-to-end pipeline, data pipelines usually run for “longer-than-average-human-attention-span” durations. This makes it hard for the data dev to use their time productively, leading to inefficient use of their time and lost productivity for the business.

In addition, you can bet there will be redundant resource usage to rerun the same parts of the pipeline that we already know work fine. Burning shared resources to check that the new code the data dev added to the end of the pipeline didn’t break anything seems excessive.

Complications:

Breaking focus is a productivity killer. The data dev spends a bunch of time loading up all the computation graphs in their head. It takes time to build a mental model of the data pipeline in their short-term memory. But then, once they launch the pipeline, they have to wait, just sitting there watching a spinner or inexpressive logs while the whole job runs. Naturally, after waiting for 3 seconds, their mind starts to wander. What’s for lunch? What are the following meetings for the rest of the day? And they ponder on their rich and complex life outside work. So our data devs depart from the mental model of the data pipeline to check something. “It will be quick,” they say to themselves. When they come back 30 mins later, they face crazy stack traces. Full of cryptic error messages about a syntax error in a templated SQL query they didn’t even touch. Oof! and that happened 10 seconds into the pipeline. That’s 29 mins and 50s wasted.

Even worse, a bad data dev lifestyle leads to working at unhealthy times of the day. Early in their new job, data devs realize that resources are constrained. That means data warehouses are busy during the day and under-utilized during the night. So they start working after business hours, at nighttime, to get shorter runtimes and faster feedback. After a few weeks, they realize how unsustainable this is because burnout is real. There must be an easier way.

Guidance:

I remember seeing data similar to the diagram Bad Data Devs Lifestyle, in [2]. I realized that many data devs share the same painful experience. The hope is that once new data devs understand that running a costly end-to-end pipeline to check that a one-line change works, is not reasonable.

Instead, they should look for ways to shorten the feedback loop between adding a single line in a function and checking if it works. This is what we will cover in the rest of this work. Initially, the breadth of techniques available can be confusing and overwhelming. Still, we will follow a structured method to organize the modern testing techniques for data pipelines. The problems are not new and your worldwide colleagues around the world have developed many techniques over the recent years. Wisdom of the crowd for the win.

TDD + CICD to the rescue?

Simplified TDD+CICD Loop

Problem:

New data devs usually fall into two camps. The first camp never heard of Test-driven Development, TDD. The second camp tried it and concluded that TDD would get them fired or at least get them late on every project. The competition is fierce. In fact, the other data devs that use ad hoc spot checks are doing just fine. They get rewarded for the quick turnaround on their projects. As a result, applying TDD feels like fighting an uphill battle. They conclude that using TDD in an organization that only partially embraces and rewards software quality is a fool’s errand.

Software attracts complexity, and data pipelines are not immune to that. Nevertheless, the original ideas of TDD by Kent Beck [6] have transformed how software is built worldwide. After a decade of working with data, I still believe that TDD is the best design technique for software that humans have come up with so far. However, it requires such a wide array of skills and techniques to be applied successfully that many data devs give up on using it in their project.

Complications:

Then comes Continuous Delivery [37] on top of that. New devs need the mental tools to write and maintain good tests. But on top of that, Continuous Delivery is imposed on them as the best thing since sliced bread. Yes, it is a fantastic method for developing and delivering software. However, its core value proposition comes from being a repeatable disqualification mechanism. CD tries to automate discarding any software bits that do not pass a set of requirements. But how do we encode requirements in software? Yes, you guessed it, we encode the requirements in tests. Without the skills and discipline to write good tests, devs still get some benefits of CD, where they automate the deployment of their code. But they miss the code value of Continuous Delivery altogether.

As the Continuous Delivery definition puts it: “Continuous Delivery is the ability to get changes of all types — including new features, configuration changes, bug fixes, and experiments — into production, or into the hands of users, safely and quickly in a sustainable way.” A new data dev might get the “quickly” bit working. But without automated testing, preferably through TDD, they won’t get the “safely” and “sustainably” bits.

Guidance:

The first step towards improving your software is to be aware of the TDD + CICD loop. In the diagram Simplified TDD+CICD Loop, we start by writing the most basic acceptance test of what the data pipeline is supposed to produce. Then we decompose the pipeline into essential parts. We then dive into the individual components of the pipeline and write tests for them before writing the code. Not writing the tests before the code will make writing the tests feel like a chore. Then we make the tests pass and repeat the process until the acceptance test works.

Initially, we want a “Walking Data Skeleton” or a “Primitive Data Whole.” Valuable enough to be deployable. Then we set up the CICD pipeline to be able to deploy the code we have so far. The deployment of a data pipeline usually requires a scheduling mechanism, so we put that in place and include it in the CD pipeline. Once we have a scheduled job that generates the most basic of outputs and a basic deployment process, we can move to the next level of complexity. We write the next acceptance tests and go through the TDD+CICD loop again. Small iterative steps are key here. We will cover this loop in detail in the rest of this book.

Want to learn more about modern data pipelines testing techniques?

Checkout my latest book on the subject. This book gives a visual guide to the most popular techniques for testing modern data pipelines.

2023 Book link: Modern Data Pipelines Testing Techniques on leanpub.

2023 Book link: https://leanpub.com/moderndatapipelinestestingtechniques/

Book link: Modern Data Pipelines Testing Techniques on leanpub.

See ya!

Part 1/3
Part 2/3
Part 3/3

Disclaimer: The views expressed on this post are mine and do not necessarily reflect the views of my current or past employers.

--

--

Moussa Taifi PhD

Senior Data Science Platform Engineer — CS PhD— Cloudamize-Appnexus-Xandr-AT&T-Microsoft — Books: www.moussataifi.com/books