Data Pipeline Foundations - Everything You Need To Know About Data Pipelines
Hi, fellow future and current Data Leaders; Ben here 👋
Over the past few years, I’ve written plenty of articles on data pipelines. So I decided to put together those articles and create a central place where you can come and read them.
This also means I’ll continue to update it in the future!
So come back in a month or two, and I am sure you’ll see even more.
Now, before diving in today, I do want to give a special thanks to Estuary. They’ve supported the Seattle Data Guy newsletter for a long time, and it means a lot to me. They’ve also helped multiple clients of mine not have to think about their data ingestion and workflows. It really has been awesome to work with them both in terms of building data pipelines and as an advisor. So thank you!
If you want to simplify your data workflows, check them out today.
Now let’s jump into the article!
Much of what we humans do is merely logistics and pipes.
Getting something we need from one place to another.
We have supply chains that transfer that coffee maker from the producer to your front door.
We have electrical grids that move electricity from the plant to your outlet.
We have water systems that take water from its source, treat it, and move it into homes and businesses.
We have transportation networks that move food from farms to grocery store shelves.
And data is no different.
At its core, a data pipeline is just another system of pipes.
That’s why over the past year or so, I’ve been putting together a series of data pipeline basics. Something anyone should know whether their title is data analyst, AI engineer, or data engineer.
Below are those articles, as well as several others that I think will help you understand most of what you need to know when developing data pipelines.
Where Data Comes From - Data Sources
Data pipelines always have a source and a destination. Whether that source is an S3 bucket with thousands of PDFs, CSVs passed via SFTP, an API, or just a connection to a database.
I know we’d like to think this is a smooth process between all companies.
But some companies are still passing data via email and SFTP.
And at least in the case of SFTP, it works.
So in this first section, there are several articles and a video all focused on data sources!
Data Sharing in the Real World: Why SFTP Remains Essential for Companies
From Basics to Challenges: A Data Engineer’s Journey with APIs
The Ultimate Database Guide: Choosing Relational, Document, or Time Series to Drive Success by Yordan Ivanov
How Data Gets Processed
Ok, the first step in most data pipelines is the extract.
After that, you get to actually process the data.
But how?
That’s where the “T” part of the pipeline often comes in. But those transformations don’t all look the same. Everything from tooling being drag-and-drop in some cases and in others being SQL-based, you have options.
Let’s start with understanding the basics.
By that I mean we’ll start with “Why”.
Why do data pipelines exist and go from there?
Building Real Pipelines
Now that we’ve got the basics down, let’s talk about what it actually takes to build these pipelines in the real world.
Below are the articles I’ve written that cover basic patterns, as well as what it would take if you wanted to build your own orchestration tooling.
Operating Pipelines in Production
Once you get past all the fluff of building, well, that’s where reality kicks in.
Every data pipeline you build is a liability.
And sure, now we have AI, which just means we will write more data pipelines faster, equating to more work in the future.
When you need to migrate, there is more to migrate.
There are more data pipelines that can fail and require a backfill.
There are more data quality issues that can arise. You can read more about it in the articles below.
Beyond the Basics
Once you’ve gotten the basics down and you understand what it takes to go from source to destination, you’ve just started your journey.
Because, of course, you also need to consider:
How are you actually going to model said data
How will you ensure you can join across multiple sources and not just recreate data silos
What about migrations
And I have barely even looked at tooling
So once you’ve finished the articles here, congrats, you’ve barely begun!
Why Your Data Infrastructure Migration Project Will Fail (And How to Succeed)
Data Modeling Where Theory Meets Reality - How Different Companies I Worked At Modeled Their Data
Final Thoughts
Even with all the focus on AI, what I continue to see companies need is a desire to get data into a central location, and then either use traditional methods like SQL and BI to ask their data questions or try to layer AI on top of things.
The funny thing that I’ve also noticed is that there seems to be a similar approach to data modeling in the AI era as in the self-service era.
Which is, just keep building new tables for specific use cases or just-in-time data modeling(term coined by Joe Reis ).
All of which is fed by, you guessed it, data pipelines.
Data pipelines that still need to bring data together, standardize it, integrate it, and make it ready for whoever will be querying it.
If you think there are other topics that need to be covered about data pipelines, let me know so I can add even more articles to this one!
Articles Worth Reading
There are thousands of new articles posted daily all over the web! I have spent a lot of time sifting through some of these articles as well as TechCrunch and companies tech blog and wanted to share some of my favorites!
Under the Hood: Scaling Responsible AI at Uber
As AI and machine learning become even more central to critical products and services—including at Uber—companies must understand how their models work and ensure they’re governed responsibly. At Uber, where AI is developed across many platforms and teams, we launched a company-wide Responsible AI program to bring visibility, explainability, and governance to models. Through a centralized Model Catalog, automated tooling, feature-importance explainability, early compliance checks, and broad employee education, we’ve built a scalable foundation for responsible AI innovation across the company.
5 Key Predictions for the Data Industry in 2026
One third of the year is over, at least by months, and somehow it feels like a year’s worth of events have occurred.
By the end of 2025, dozens of companies were swallowed up. Everyone wanted to buy everyone, and here we are in 2026, seeing more of that as well as some pretty slick new AI model releases.
But let’s turn towards the future, and specifically, data.
Here is what I believe we’ll see happen in the next year or two.
End Of Day 216
Thanks for checking out our community. We put out 4-5 Newsletters a month discussing data, tech, and start-ups.
If you enjoyed it, consider liking, sharing and helping this newsletter grow.




Thank you for writing this!