Backfills - The Necessary Evil of Data Engineering
Why backfills happen, why we hate them, and how to handle them without breaking trust
Hi, fellow future and current Data Leaders; Ben here 👋
One thing most of us data engineers dislike are backfills. Why is that? And what does backfilling require?
Before we jump in to talking about backfills, I wanted to share a bit about Estuary, a platform I’ve used to help make clients’ data workflows easier and am an adviser for. Estuary helps teams easily move data in real-time or on a schedule, from databases and SaaS apps to data lakes and warehouses, empowering data leaders to focus on strategy and impact rather than getting bogged down by infrastructure challenges. If you want to simplify your data workflows, check them out today.
Now let’s jump into the article!
At some point, if you work in data, whether you’re an analyst or a data engineer.
You’re going to have to do it.
You’re going to have to backfill a table.
Actually, it’ll probably be pretty early in your career. Backfilling or rerunning a pipeline is just a necessity, AI or not.
There are plenty of reasons why you might need to backfill a table..sadly.
Talking to data engineers…many of them dislike the process of backfilling.
So let’s start there. Let’s discuss why we backfill as data teams and why we dislike it so much.
Why Do We Need to Backfill Data?
Backfills exist for many reasons. For example, systems are not static, and people make mistakes. In turn, tables and data sets need to be rerun.
Some of the most common reasons you’ll need to backfill include:
Late or corrected source data - Upstream systems change historical records all the time. Maybe the data was being recorded incorrectly, or if you’re getting SFTP files, they might have been sending you bad files that no one caught. Now you’re going to have to backfill at least a specific date or customer cut of that data.
Bugs in your data pipelines - A common reason why tables need to be backfilled is that there was a bug. Something was wrong, and maybe just running an update statement isn’t sufficient. So now you’ve got to rebuild the table with the new logic without disrupting end-users.
Schema or logic changes - This was a common reason I needed to backfill at Facebook. When columns were removed at Facebook or, in some cases, certain data type conversions were required, we’d have to rebuild the table. Then add in new logic changes or sources for data, and you’ll likely need to reload the entire table.
These are, of course, only a few reasons!
Why Data Engineers Dislike Backfilling
If you ask a data engineer what they don’t like doing, I am sure backfilling will be one of the few things they reference, besides migrations.
Here are a few reasons why.
Scale- In some cases, backfills mean waiting hours, if not a day or two, to rerun a table. At Facebook, sometimes backfilling a table meant needing to rerun thousands of jobs for each partition and each of the upstream tasks. That means there are a lot of ways things can go wrong.
Cost - You’d better be sure your backfill updates are right. Having to rerun a backfill job on pay-as-you-go technology will be expensive, especially if you have to load the data from raw.
Time Consuming - There are multiple ways backfills take time. They can bump into daily jobs, especially if you are on-prem. They also take time out of the day of an engineer who has to ensure all the data is accurate and runs as expected. It’s just one giant time suck that keeps the data team from delivering new work.
Blast Radius - So you’ve built a table that everyone at your company uses and relies on. Great, now it’s going to take even longer to backfill. I had multiple cases where I’d be backfilling a table that had dozens, if not hundreds, of end-users. You’re going to need to update them and make sure they know what’s happening and if they need to do anything. If they have to modify their pipelines, then that further drags out your process.
Trust – If stakeholders see numbers change unexpectedly, they are going to question it. So one of the many reasons data teams dislike backfills, especially if they have to do it frequently, is that it can erode trust in the overall dataset.
Backfilling “Controversy”
As I was going through to see what other people had said about backfilling over the past few years, I ran into a discussion on a post between Zach Wilson and Brian Greene from over a year ago(not trying to restart a fight here, just making sure people don’t feel like I am talking about them behind their backs).
The argument itself became a little heated, but stripping away that component I do think there is something worth talking about.
I think both Brian and Zach have different experiences in different systems that, in turn, have different requirements for backfilling.
A goal you should have when backfilling(amongst the obvious of backfilling) is not to run a bunch of random SQL scripts against production.
Instead, you should create an approach that lets you maintain a repeatable process that balances making changes safely with giving space to ensure the new data is correct.
In some cases, you’re working at a company where your tables aren’t hit that frequently, and you can likely safely swap them with a simpler process. You also might not be required to have a strict audit structure.
In other cases, you’re working at an organization where you need to audit every single change because it’s life or death. That requires a far higher level of scrutiny and patience when making even the smallest change.
Zach was referencing more of a blue-green style deployment( I think), where you swap one table for another, whereas Brian, at least from what I could tell, was suggesting that your pipeline should just be re-runnable.
I do think part of this is more common in modern tech stacks that use write once, read many approaches and storage types like ORC, Avro, and Parquet.
This is because depending on the underlying formats you might run into various limitations on data type conversions and how to remove columns.
I think another person who captured the blue-green like deployment is Albert Campillio(in the image below). I do think the one thing I’d call out in the diagram below is between steps 3 and 4. This is likely something someone would consider hacky, as it’s not necessarily part of an automated process, and someone might read it and just run two ALTER statements in a row, leaving a moment in time when the actual production table doesn’t exist.
Or worse, maybe you lose internet connection in between both statements, and you were running them one at a time. Now there’s a massive gap. If this were in some highly regulated environment with people’s lives on the line, you’ve now posed a lot of new risk.
All this said, depending on why you need to backfill and also what your underlying infrastructure is, will change how you approach the backfill.
For example, here are a few specific cases.
You’ve received bad data for a specific customer or external partner
You’ve received bad data from a specific source
You need to update the logic that is incorrect or needs to be changed
You need to remove columns, and your table doesn’t allow for that
You need to alter a column data type, but you can’t do that
Another point worth considering is whether or not you have a traditional database table. Think just creating a basic table in Snowflake or SQL Server vs. one with data partitions, as this changes not only data modeling but also backfilling.
What To Consider When Backfilling
As I was writing, this article started to get rather long. So it looks like this may have to be two articles. So in this one, let’s talk about likely types of backfill approaches.
Traditional Table
Let’s say, for example, you have a traditional table, and your pipelines are set up to pull data from CSVs inside of an SFTP, and your external data partner realizes they’ve been sending you the wrong data for one reason or another.
Then you should have a pipeline that lets you reference:
External Partner ID
Dates you want to run on
In this case, you can just rerun your pipeline as is if you’ve built it correctly. It will remove the old data if there is replacement data, and you only need to replace a single instance of the data. Meaning once the data is updated, everyone has access to accurate data.
The other point I’d bring up here is how long the pipeline would take to run, as well as if it’d block other key tasks. As that might also change your approach.
Partition-Based Tables
Now, let’s say you have a table with 180 partitions. All of which have bad data in it. If you run a backfill against all the partitions, there is a good chance that you’ll have inconsistent data, especially if some of the steps take a few minutes(or longer) per step.
I recall some of the backfills I ran at Facebook would be 1000+ tasks. If you need to backfill 100 partitions, it could get pretty chaotic. Of course, far before I left we started setting 90 day retention periods so at least you had fewer partitions to backfill.
In these cases, swapping tables, although “clunky,” would result in a safer transfer.
You:
Reduce downstream tables ingesting inconsistent data
Don’t risk having teams reading from different partitions and, in turn, getting inconsistent metrics
Etc
There is also another reason why you might want to swap tables vs. just replace data via a re-runnable pipeline.
Altering Table Schemas
This one is less of an approach or table type, but more of a “why” you might have to swap tables.
There were a few cases where replacing the prior table was necessary. Often this was due to the fact that a data type needed to be changed, which couldn’t be done due to the underlying file format, or data type primitive conversions would fail.
Also, removing columns also posed issues, as this, in particular, for the table at Facebook, wouldn’t really delete the data. It would just essentially hide the column. So in many cases, we’d have to recreate the table to truly get a clean table.
I do think this might be some of the reasons for the practices taken on at companies like Facebook. I’ve worked in many different data stacks, and I find that every team finds a way to safely backfill and rerun pipelines.
Final Thoughts
There are sadly many reasons you might have to backfill a table. You can limit the number of backfills you have to do by:
Having reliable and useful data quality checks
Designing pipelines to be easy to rerun and parameterized
Avoiding one-off fixes in production
Understanding the limitations of your storage format
But it will still happen. So build your tables and pipelines with backfills and reruns in mind. So maybe you’ll hate them a little bit less.
As always, thanks for reading!
Articles Worth Reading
There are thousands of new articles posted daily all over the web! I have spent a lot of time sifting through some of these articles as well as TechCrunch and companies tech blog and wanted to share some of my favorites!
The Analytical Skills No One Teaches You
When I asked Olga Berezovsky to write an article, I wanted it to be focused on skills that data professionals don’t get taught explicitly.
There aren’t a lot of videos out there on how to deliver an impactful analysis to executives.
Even when it comes to running an analysis, many of us likely had to feel around in the dark a bit. I was just speaking to another data science leader who said they had to have an executive essentially take them aside and let them know their analysis weren’t great.
There are so many of these skills that analysts and engineers alike have to pick up on the job and no one tends to tell you what is good or bad.
So let’s talk about some of those skills you need to start working on!
The Insanity of Data Education
Yesterday, I published a piece on the organizational crisis in data modeling, based on a survey of over 1,100 data professionals. The overarching theme? A whopping 89% of respondents are struggling with their data modeling approach. Only 11% say things are actually going well.
When you dig into the complaints, the numbers paint a bleak picture:
59% cite the relentless pressure to move fast.
51% suffer from a complete lack of clear ownership.
The TL;DR here? It’s your boss’s fault. I’m half-joking, but not really. More like your boss’s boss’s boss’s fault.
End Of Day 211
Thanks for checking out our community. We put out 4-5 Newsletters a month discussing data, tech, and start-ups.
If you enjoyed it, consider liking, sharing and helping this newsletter grow.







I already feel a headache after first paragraph haha
How do I subtly backfill my office cut out snippets of these so that they get the drift. Asking for a very tired analyst friend. Definitely not a stakeholder.