Hi, fellow future and current Data Leaders; Ben here 👋
Today I wanted to talk about Iceberg. I’ve been seeing a lot about it recently. Everyone wants to be involved in the open table format business. But my question is, will it actually solve a business problem?
Before we dive into the article, I’m also excited to share that I'm putting together recordings of my 6-week Data Leaders Playbook Accelerator, which I'll be running again in September. This six-week program is designed to help data leaders drive greater impact and play a more strategic role within their company. If you’re interested, you can sign up for the program here.
With that let’s jump into the article!
Intro
Maybe I’m biased. I love new tools. Like many technophiles, I’m drawn to new solutions. Yet, I still feel I need to pause and question when a new tool is posed as the savior of the data world.
Many people are searching for a silver bullet and assume it must be a new piece of technology. And vendors love it—they lean in. They coin new terms, repackage old ones, and flood the space with “best practices” and design patterns built to hook our curiosity.
And let’s be honest: part of what drew many of us to tech in the first place was the appeal of solving complex problems. Nothing scratches that itch like a new tool packed with just enough nuance and edge cases to unravel.
But lately, I’ve been asking myself a different question—especially when it comes to technologies like Apache Iceberg:
Do we even know what problem we’re trying to solve?
What Question Does Iceberg Solve?
When people start rattling off reasons why Iceberg is great, I can’t help but think of something Shachar posted:
What in the world is the question you’re trying to answer?
You’ll often hear folks say Iceberg is vendor-agnostic, or that it supports time travel. Fair enough, time travel can be useful, especially in terms of data management( I will talk about my thoughts on it being vendor agnostic later). One example I’ve heard is that it helps you avoid creating four versions of the same table while testing. Helpful, sure. But it’s not exactly a killer reason to switch.
Truthfully, I haven’t yet seen a use case that makes you want to toss everything else out and go all-in on Iceberg. (And I’m sure I’ll get a few comments or DMs after this article).
To be clear, I’m not saying larger companies can’t benefit. If you’ve got an experienced team and a mature platform, you might be able to take advantage of what Iceberg offers. But if you’re a small team, or just not a Big Tech company, I don’t think you’ll get the payoff you expect.
I’ve found myself quoting this article a lot lately, but it still resonates even though it was written over two decades ago:
Remember that the architecture people are solving problems that they think they can solve, not problems which are useful to solve. Soap + WSDL may be the Hot New Thing, but it doesn’t really let you do anything you couldn’t do before using other technologies —(who knew Joel had access to chatgpt so long ago) if you had a reason to - Joel Spolsky Don’t Let Architecture Astronauts Scare You
I think many of us cling to new tools because they let us focus on problems we know how to solve, things we can control, rather than focusing on messier, more ambiguous challenges of understanding the business.
Even if you read a dozen articles like the one above, understanding the business in practice is far harder. As
noted, some lessons you just have to live through. You can’t look up stakeholder buy-in on Stack Overflow because each one might need a different approach, and that approach can change depending on the day.So instead, we retreat to what we can control, even if we’re solving problems the business won’t ultimately value.
The Central Truth
One of the arguments suggesting that Iceberg could help improve your data strategy is the fact that it’s vendor-agnostic. I actually just saw a post on LinkedIn that said something like, “We need to focus less on tech and more on the business,” and then went on and listed a dozen tools, including Iceberg, somehow suggesting that being vendor agnostic is what will create business value.
The irony.
I do agree that if used correctly, Iceberg could act like a more true centralized source of truth.
In the past, so many vendors and products pitched themselves as a single source of truth. Of course, they would end up siloing data into yet another database, whether that be Oracle or Teradata.
Even now, many companies I’ve worked with or spoken to have hybrid set-ups where Databricks might act as their ML workflow solution and Snowflake their data warehouse layer(despite obviously both companies not wanting that to be the case) the data might exist in various forms in an S3 bucket and Snowflake and then you’re having to port it all around.
In that world, Iceberg won’t save your data strategy. It also won’t drive business impact. However, it will reduce the friction required to deploy data workflows. Instead of having to port data around, you could interact with data as you please with the data engine you wanted. That is one of the benefits that many toute when it comes to Icebergs, and it does make sense.
That is until you realize many of the issues many larger organizations face that would benefit from this are organizational, not technical.
Every VP wants to own their own fiefdom of projects, and everyone wants their own data workflows. So, somehow, most companies would likely end up with three different implementations of Iceberg with six different catalogs. But hey, that problem could be an opportunity.
So We Shouldn’t Use Iceberg?
When I first got into data, Hadoop was at its peak. It felt inescapable—every conference featured Hortonworks and Cloudera.
Data Lakes were the future.
Now, a decade later, Hadoop's search trends have fallen off like a one-hit-wonder. But to say it didn’t help shape today’s data landscape would be shortsighted.
Many great projects, like Presto/Trino and the Hive Metastore , emerged from the Hadoop era. They were built to address the pain points people faced working with the original tech.
I’m not saying you shouldn’t learn Iceberg. But I’d be cautious about deploying it unless it solves a very specific problem you’re actually facing.
And if you’re spending $10K a year on Snowflake? I’d wager a migration probably isn’t worth it. Just a hunch.
I made a similar point in my Columnar storage article, where I mentioned how Uber saved money and storage by switching compression formats. But Uber operates at a massive scale, where those changes actually move the needle.
Do you?
Iceberg Is Not A Strategy
If your data team isn’t driving value or engaging with the business, switching to Iceberg, or dbt or Presto, for that matter, won’t magically fix your problems.
Writing SQL faster won’t save a broken data strategy. Swapping one tool for another won’t either.
Actually, I want you to pause. Close your eyes.
Let’s say you build your dream architecture. Iceberg is humming along at the center of it all. Now what?
How are you delivering value to the business? If all technical friction disappeared tomorrow, what would you actually do?
It’s like asking: if you suddenly had $100 million, how would you spend your free time? Most people don’t have a clear answer.
And if you don’t, maybe you’ll just chase the next shiny tool.
On the flip side, if your data team is already delivering value and you’re looking to improve workflows, Iceberg might be worth exploring. It could absolutely have a place in a more mature stack.
But that’s not the same as saying it is the strategy. Because tools don’t solve business problems, people do.
Final Thoughts
As technophiles, it’s hard to escape the FOMO.
Netflix is using a new technology, shouldn’t we be using it, too?
Our favorite tool just released a new design paradigm, better dive in and implement it immediately, right? Surely, it’s been battle-tested. But after the migration, you usually find there’s a trade-off. There always is.
Maybe you ditched Teradata for Hadoop and saved on licensing, only to spend that or more hiring more engineers to manage the complexity.
That’s why, as a data engineer or architect, it’s worth stepping back and asking the harder question: Does this tool actually benefit my company? Or am I just chasing infrastructure for infrastructure’s sake?
As always, thanks for reading.
Join My Data Engineering And Data Science Discord
If you’re looking to talk more about data engineering, data science, breaking into your first job, and finding other like minded data specialists. Then you should join the Seattle Data Guy discord!
We are now over 8000 members!
Join My Technical Consultants Community
If you’re a data consultant or considering becoming one then you should join the Technical Freelancer Community! We have over 1500 members!
You’ll find plenty of free resources you can access to expedite your journey as a technical consultant as well as be able to talk to other consultants about questions you may have!
Articles Worth Reading
There are thousands of new articles posted daily all over the web! I have spent a lot of time sifting through some of these articles as well as TechCrunch and companies tech blog and wanted to share some of my favorites!
Petabyte-scale data migration made simple: AppsFlyer’s best practice journey with Amazon EMR Serverless
Organizations worldwide aim to harness the power of data to drive smarter, more informed decision-making by embedding data at the core of their processes. Using data-driven insights enables you to respond more effectively to unexpected challenges, foster innovation, and deliver enhanced experiences to your customers. In fact, data has transformed how organizations drive decision-making, but historically, managing the infrastructure to support it posed significant challenges and required specific skill sets and dedicated personnel. The complexity of setting up, scaling, and maintaining large-scale data systems impacted agility and pace of innovation. This reliance on experts and intricate setups often diverted resources from innovation, slowed time-to-market, and hindered the ability to respond to changes in industry demands.
Data Sharing in the Real World: Why SFTP Remains Essential for Companies
Every day, companies share data sets of users, patient claims, financial transactions, and more with each other.
Most people might assume this would be via API. However, companies have been sharing data for decades using CSVs, TSVs, positional files, and other formats you might not be familiar with. Not via API, but SFTP.
I know I wouldn’t have guessed that’s how companies send data back and forth when I was in school.
End Of Day 181
Thanks for checking out our community. We put out 4-5 Newsletters a month discussing data, tech, and start-ups.
If you enjoyed it, consider liking, sharing and helping this newsletter grow!
iceberg is so last week #ducklake
Well, you can't finish the journeys that you don't start, so you're never going to arrive at the nirvana of centralised data if you don't start somewhere.
More and more major platforms are going down the direct connect route, for example, Salesforce Data Cloud, which is white-label Iceberg, bi-directional to Snowflake. There's a good chance that all major platforms will have to provide an Iceberg connector and we may have the chance of reaching this ideal world of a federated data mesh with an Iceberg front-end layer.