Storing data for analytics.
Why do we even have to do it. Why do we spend so much time just moving data from point A to point B?
Why was one of the largest software stock IPOs a cloud data warehouse(sorry cloud data platform).
How we store data for analytics matters. It impacts performance, the ability to join data across different domains as well as track historical information.
For a long time, everything was done on-prem. Eventually the constant need to migrate to larger server instances and rising costs led data teams looking for alternatives. Over the last few years we have seen several transformations in how we process and manage data.
Much of that has occurred in the last two decades as companies continued to search for the optimal analytics data storage set up.
Let’s dig into the past few decades and see how data storage for analytics got us to where we are.
The Past
To understand how we got to where we are on this journey of data storage for analytics. Let’s start with some of the earlier forms of data warehousing solutions.
Don’t worry, I won’t go too far back.
Let’s talk SAP, Vertica, and Teradata. Honestly, there are a whole lot more.
These were on-prem solutions. Don’t knock them just for being on-prem. These data warehouses have some serious fire power.
Teradata was managing 1 terabyte (a trillion bytes) in 1992 at Wal-Mart. That’s 30 years ago. These data warehouses were able and in many cases have adapted to being able to process the amount of data that companies have today.
However, there were plenty of issues with some of these systems. Especially since they were on-prem and often locked in on server size. They quickly would become expensive as you needed to manage more and more data between both licensing and employee costs.
Also, as data started to explode they would struggle to keep up with the ever-expanding storage and compute required. Perhaps you didn’t need to pay for a larger instance or server initially as your company started but maybe 18 months down the line you would need to migrate.
So do you pay for migration costs or upfront do you just pay for a larger server? Both are bad for your bottom line.
Enter the move to cheaper hardware and the cloud.
The Dark Ages - Cheap Hardware And The Cloud
When I first started in the data world I worked on Oracle and Microsoft SQL server based data warehouses. But while I was working on those products, Redshift and Hadoop were raging.
Redshift was the first widely adopted Cloud based data warehouse and Hadoop proved to be a “cheap” solution to help better process large amounts of data on cheaper hardware.
I am being a little harsh by calling this the dark ages. These technologies were the stepping stones we used to get where we are now. Hive, Presto, Trino, and Spark all require the initial steps we took in this era.
Also, Redshift remains the leader in market share for cloud-based data warehouses. These tools built what we know as modern data warehousing and data lakes.
However, both Redshift and Hadoop proved to be painful to manage.
Redshift wasn’t developed for classic data warehouse developers. The fact that traditional UPDATE clauses weren’t even possible caused major issues. What if you wanted to run updates or use a merge statement.
Well, they created the “Upsert”. When I first heard this I thought “Awesome! And it’s like a merge clause right?”
Wrong.
Amazon Redshift doesn't support a single merge statement (update or insert, also known as an upsert) to insert and update data from a single data source. However, you can effectively perform a merge operation. To do so, load your data into a staging table and then join the staging table with your target table for an UPDATE statement and an INSERT statement. For instructions, see Updating and inserting new data. - Amazon Docs
So even running a basic merge required extra objects to run what would be considered a standard process in older data warehouses. There were also plenty of other issues.
The Hadoop side wasn’t that much better.
On the Hadoop side of the world we threw database knowledge out the window for 15 years - Ryan Blue(Co-Creator Of Apache Iceberg)
Hadoop didn’t provide a SQL interface(right away) and was hard to manage. Once it did provide a SQL interface it also had similar issues when it came to updates and running ALTER statements on columns.
So even though the tool itself was open source, your team could spend a lot of resources just to manage all of the various components required to operate Hadoop.
As Juan Sequeda and Sarah Krasnik once discussed:
Overall, these tools did lay the foundation for the next phase of cloud data warehouses and data lakes.
Keep reading with a 7-day free trial
Subscribe to SeattleDataGuy’s Newsletter to keep reading this post and get 7 days of free access to the full post archives.