Daily Tasks With Data Pipelines - Data Quality Checks And The Problem With Noisy Checks
Every morning, your team wakes up to 137 data quality alerts.
132 are ignored.
3 are acknowledged, maybe you even reach out to the producers of the errors, but they are ignored…
2 get fixed.
Ok, that might be a bit of an exaggeration.
But it’s so easy to build data pipelines and add data quality checks these days that I am sure for some people this is reality.
After all, all you need to do is…
Write a few prompts, drag a few boxes and arrows, and check a few boxes, right?
Now you’ve got another 215 data quality checks to wake up to.
For the past few months, I’ve been writing a series on data pipelines, and I think an important aspect is understanding the maintenance tasks you will have to take on by building your data pipeline.
One of those common tasks involves data quality checks, and I don’t just mean writing them.
I mean, dealing with noisy notifications.
The Promise of Data Quality
Data quality solutions are built on the promise that they will help build data sets you can trust.
They’ll catch bad data before they enter your systems.
And trust me, I’ve spoken with plenty of data teams where the business has lost all trust in their data warehouses and lake houses.
That’s when many data leaders turn to finding either solutions or building. their own data quality tools.
They want to make sure the business trusts the data coming out of their systems.
Which is a reasonable goal.
I mean, I wrote an article about it!
How And Why We Need To Implement Data Quality Now!
As companies look to incorporate AI and ML into their data strategies and roadmaps, there is a new opportunity to refocus on data quality.
But eventually, this can also cause other issues.
Where It Breaks - Noisy Checks
When I started in the data world, I had to develop my own system for managing data quality.
It was essentially a Python system that used SQL templates where the user could provide a few parameters to test several different generic types of errors.
Want to allow for a certain percentage of nulls on a specific column? Well, just fill in the following parameters:
column_name: patient_middle_name
percentage_warning: 95
percentage_failure: 80
Now you’ve got similar checks with tools like dbt.
Great, so you’ve got your system for data quality!
Problem solved right!
But there is a flipside to data quality checks. They really are great, but it’s easy to go from, let’s build a few checks to, every column now has dozens of checks and every morning you have dozens of notifications that you eventually start to ignore.
Either because there are bigger issues, you’re not rewards for fixing data quality problems, the teams that produce them aren’t interested in fixing them, etc, etc.
So let’s break down the key areas and reasons checks get noisy.
Over-checking Everything
Every column gets rules - It’s tempting to make sure every column gets checks. But maybe you don’t use every column, or maybe some are just frequently null, and until you plan to fix them at the source, you’re just creating noise.
Nobody prioritizes what actually matters - Some tables can have hundreds of columns. Now, you probably shouldn’t bring them all in unless you really need them, but if you do, and then add several data quality checks per column, it’s going to get chaotic. That’s why you need to be specific about which columns really matter, as well as what errors will really break things.
Poorly Tuned Thresholds
This is less of a problem with many modern data quality solutions, as they often use some model to detect thresholds dynamically. Now, I’d hope that they’d also be able to detect drift over time. That is to say, maybe you have a slight increase or decrease daily, but it’s just enough so it never triggers anything.
Alerts Lacking Clear Ownership
At the end of the day, if you have data quality checks going off, but no one is set to own the pipeline or the fix, well, guess what, it’s not getting fixed. So someone on the team needs to own them. This could be something that your on-call team member takes on.
Misaligned Incentives
The team generating bad data is rewarded for shipping features, not fixing pipelines.
I think most data teams feel this one.
The application team focuses on building an application that functions. That doesn’t always mean the data is tracked in a useful way. I talked about this in a past article, but maybe they don’t use their updated_at field or simply write over information as someone changes where they live. As long as the application functions, they might be ok with this.
So the bad data will continue to flow…
What Actually Works
If your data quality solution is more traditional, then here are a few simple things you can do to reduce the noise.
Focus on fewer checks that are on columns that actually matter
Prioritize business-critical tables
Tier your alerts (critical vs informational) and make it so they only bubble up under certain conditions
Assign clear ownership
Regularly delete useless checks
Now that’s a start, but that’s far from everything.
I am also going to add a caveat here.
I think with AI, some of the points above might have more nuance.
If you can create a system that detects multiple data quality issues, but only surfaces the critical ones or at least knows how to better consolidate and provide the end-user with easy-to-digest information, then that will circumvent some of these challenges(I am sure some tools out there do so). But I will always tell readers and clients to be wary. Many solutions look good in a demo, but fail to live up to their promise.
Final Thoughts
Data quality is important, and you need to ensure that your data pipelines produce reliable data.
But there is a point where data engineers will start ignoring checks, especially if they put effort into trying to fix the issue, and they are:
Not provided support
Not rewarded for creating systems that help improve quality, or systems that can help reduce the burden of noisy checks
Working in systems where bad data is tolerated downstream anyway
Measured on delivery speed, not data reliability
Expected to own quality without owning the upstream systems
Lacking context on whether the issue even matters to the business
Not involved in defining what “good data” actually means
Pushing a pipeline to production is not the end.
You will have to maintain it.
So don’t expect it to end with the pipeline. You’ll have plenty more work to do once it’s published.
But maybe you’ll have an agent take care of that in the future.
As always, thanks for reading!
Articles Worth Reading
There are thousands of new articles posted daily all over the web! I have spent a lot of time sifting through some of these articles as well as TechCrunch and companies tech blog and wanted to share some of my favorites!
Composable Chaos: Why Enterprise Architects Matter More Than Ever
As businesses adopt AI, composable architectures and increasingly complex digital experience stacks, the role of the enterprise architect is shifting from technical oversight to strategic design authority. Once focused primarily on governance and system alignment, enterprise architects are now responsible for ensuring that rapidly evolving technologies can scale, integrate and deliver measurable business outcomes.
This shift reflects a broader reality: without a coherent architectural foundation, even the most advanced tools can create fragmentation rather than value. This article explores how the role of the enterprise architect is evolving, why it is becoming central to enterprise strategy, and what it means for businesses navigating increasingly complex technology ecosystems.
Full Refresh vs Incremental Pipelines
The need to extract and centralize data…surprisingly…remains a challenge.
Perhaps in the future, when we can all vibe code our own apps on Snowflake or Databricks Postgres services, and the line between analytical and operational data stores blurs further, we won’t need this.
But for now, even small organizations can have dozens of data pipeline solutions pulling out data.
It’s easy: find a reliable data connector solution, build a few dbt models or some drag-and-drop solutions, and you’re done.
Of course, it’s not that simple. As part of the process, you need to consider how those pipelines will actually load data into your tables.
Which also means you need to understand your data.
Is there an update date? Can it be updated? How is it updated?
From there, you can start understanding how you want to build your data pipeline.
Bringing us to today’s topic, incremental and full table refreshes.
So let’s dive into it!
End Of Day 215
Thanks for checking out our community. We put out 4-5 Newsletters a month discussing data, tech, and start-ups.
If you enjoyed it, consider liking, sharing and helping this newsletter grow.





Great writing as always! Addresses "how do you close the loop between report data and source systems?"
One hack that worked well for me is to track error counts over time and find a few execs to keep an eye on the scorecard. They'd occasionally make some well-placed calls to neglectful rule owners, which motivated everyone to fix data at the source.
Thanks!
ZH
This is the death spiral most data teams don’t name out loud. Stakeholder complains. Team adds a check. Nobody prioritizes which checks matter. Six months later you have 400 alerts firing and the on call just ignores them. Rather than stacking more checks on top of a broken feedback loop, the move is treating data quality like a product with actual SLAs. Not “we check everything” but “we guarantee these 12 things and here’s the trust score to prove it.” It’s the difference between a smoke detector in every room and a fire marshal who knows which buildings actually matter.