AWS has jumped on the bandwagon of removing the need for ETLs. Snowflake announced this both with their hybrid tables and their partnership with Salesforce.
Now, I do take a little issue with the naming “Zero ETLs”. Because at the very surface the functionality described is often closer to a zero integration future, which probably doesn’t come across ‘sexy’ enough. This may also only be phase one of AWS and Snowflake’s plan to remove the need for ETLs.
Overall, I do agree with the idea of reducing the amount of duplicate logic and data that exist. So if there is some form of path that leads to a zero ETL world, we should make it happen.
But what would it take?
In this article, I will go through a Zero ETL future and how we might get there.
The Problem Of Perception
When I read or hear about announcements like this, I assume there are undiscussed nuances. But I have found when the business reads these types of announcements, they take them at face value.
They come back and tell their team, we want to move to this no-code, zero ETL and serverless future. It all sounds good from a business perspective: costs will be reduced, head counts can be slashed, and value from data can be gained immediately.
But this will skip over all the other nuances that are unavoidable.
Before diving into a Zero ETL future, let’s review some of the reasons ETLs exist.
Why We ETL
Simply duplicating data from point A to point B is not an ETL. If that’s all we needed to do, we could just create replicant databases and report off of those. So why create complex systems and use tools like Airflow or Prefect at all?
Why hire expensive data engineers?
Why even ETL?
Historical Data - Generally speaking, most operational databases don’t track historical data. Specifically they don’t track historically changing entity data like where a customer lives.
So when data gets updated or deleted, if you’re not using CDC(change data capture), you’re going to lose information. This is why the concept of slowly changing dimensions exists - to help track information to ensure if a customer moves states or an employee changes jobs, we can accurately reflect this over time.
Now don’t get me wrong. What you could do instead of a traditional SCD(slowly changing dimensions) is simply create a date partition and load data into a table that is ever expanding. We did this from time to time at Facebook because it’s a simpler design.
Keep reading with a 7-day free trial
Subscribe to SeattleDataGuy’s Newsletter to keep reading this post and get 7 days of free access to the full post archives.