Data Modeling Where Theory Meets Reality
How Different Companies I Worked At Modeled Their Data
Data modeling varies at different companies.
Even companies that all say they are using dimensional modeling or at least use terms like facts and dims might differ in the way they implement it.
Over the past near decade, I have worked for and with different companies that have used various methods to capture this data. I wanted to review some of the techniques that are commonly used to model data for analytics.
This is part of my unofficial series on data modeling, so if you’d like to learn more, then you can check out some of the prior articles such as this one.
But for now, let’s dive in.
Tracking Changing Dimensions
If you’ve queried a data warehouse before, there is a good chance you’ve come across slowly changing dimensions. Perhaps you didn’t realize it but if you saw some form of start_effective_date, end_date, effective_date, or end_date combination, that’s likely what you were seeing.
Now in the wild, I have really mostly seen SCD (slowly changing dimension) Type 2 and some hybrid versions of Type 6. But mostly Type 2.
Type 2 inserts a new row for each change and captures when those changes occur.
The common example I use to explain how SCD Type 2 works is if you want to report on the number of days a person works in a specific role.
From a business standpoint you might want to answer how long it takes for different teams to promote an individual. In order to do so, you need to know when a person changed from one level or role to another.
So you might store your data like the one below:
![](https://substackcdn.com/image/fetch/w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa1e4f2b6-efaf-4696-81f0-68069e7e19c0_1857x592.png)
This captures when certain changes happen and how long they last. This means you can safely report on the question we gave earlier(How many days on average does it take for an employee to switch roles).
There is another type of SCD called Type 1, which is simply updating the information by replacing the old information. The problem here is you lose the prior information meaning you can no longer report on it accurately. Because in our example above, you’d just replace John’s job_title with data engineer and perhaps change an update_date column(if that exists). But you lose the fact that John was a data analyst.
In turn, that’s why I’ve mostly come across SCD type 2. But it’s not the only way to capture data that changes.
Facebook And Date Partitions
When I first started working at Facebook, one of the differences I noted was that, unlike most other companies that I worked at, they tended to take snapshots of dimensional information every day(honestly, they also used the term “dimension” loosely"). In fact, I have noticed that many tech focused organizations seem to reference something similar. But not everyone is sure if they are bought in.
For those new to the idea of daily partitions, you can imagine it looked like the image below.
This does mean that you need to join tables on their IDs and ds columns. So if you wanted to know where a customer or an employee lived or what job they had when they clicked a page or bought a product, you could. But you have to make sure you have a WHERE clause that filters for the date you’re interested in.
So instead of dealing with the traditional slowly changing dimension approach where you’d have to use a “BETWEEN” clause, you could just join on the ds or dt. Those were the common way to name said column.
With this method, there are challenges, in fact we occasionally still had to create tables with a more traditional SCD Type 2 approach. But most use cases worked pretty well using the partition method.
Tracking Events In Big Tech
When it comes to handling event data, you might assume it’s not that hard.
After all, it's a simple situation. For every new event, simply insert the data, right?
Maybe you have to delete some of the prior data and then insert but there isn’t a need to do it as much in terms of inserting, updating, and merging.
That all works great until your data gets so large it takes too long to compute or maybe you’re just looking to improve the performance of some dashboards. Then suddenly, you need to rethink how you manage event data.
This has been discussed by several other pieces of content but data structures such as date lists in a single field are an excellent way of reducing processing costs(the trade off often being that you lose granularity).
I really enjoyed the piece from Connor Skennerton where he reviewed a talk given by Roblox engineers who were using a data list for that purpose. Here is a quick snippet.
In his talk, Yan and William gave the example of scanning over a raw fact table that had 10 TB of data generated per day. Scanning all the historical data every time a query was run required looking at petabytes of data. In comparison, using datelist table they only needed to scan through 10TB + ~0.5TB per day. - Datelist tables at Roblox Data Engineering Meetup
But what does this look like in real life.
Let’s go over the example below. As you can see in the table below, this could be viewed as a standard page visit style table, where each visit to a url is captured with a user_id and when the event occurred(you’d also likely capture other bits of information such as time spent on page, an event id, etc). For now, we wanted to focus on how this data can be reshaped.
You can take that lower-level data and aggregate into a table that aggregates number of page views by the day. After all many high-level dashboards don’t need to have all the information, such as exactly which pages were viewed.
Now if you wanted to take this one step further you could use a date list to aggregate all that information on a user-level. So now you only have one row per user. As referenced above, this becomes very useful on very large data sets because now you only need to look at the prior date partition and whatever users were active in the new date partition and merge those. Meaning you’ll never have to pull more historical rows of data than users. So instead of possibly dealing with tens if not hundreds of billions of rows or as referenced above, petabytes, you can drastically reduce the number. In turn, reducing how much data actually needs to be processed.
showed a similar concept in a video he shared not that long ago with the title, Data modeling a 100 TB data lake into 5 TBs with STRUCT and Array.This method of taking data to track events in my experience fits large tech companies rather well, both due to the metrics they focus on such as retention, monthly active users, etc as well as the data size they are dealing with.
I haven’t seen this specific data modeling technique used outside of tech organizations or at the very least outside of specifically reporting on retention and other metrics engagement metrics.
Overall, it fits this specific use case rather well. I would love to know if anyone has seen it used elsewhere!
Fact Data - At Most Companies
Now taking a step back from aggregating and reporting, let’s look at just fact data itself.
Again, it should be easy, new data comes in, you insert it.
Done.
But here is the gotcha.
Not all events and fact data are page clicks. Some are billing for your visit to the hospital, others are orders where people return products they purchased.
How operational systems handle that can be very different, although how you store the data might be straightforward. How you process it could vary.
For example, perhaps the system you’re pulling from sends a correction line (e.g. if you spent $50 on a product, but return it, it sends a -$50 in a new row with the same ID). You might also have to pull this information from a completely different table that is labeled returns.
Another example could be for healthcare claims where the price of a service might be dispute. In order to correct it, you either likely have to replace the data with a new row for that claim or aggregate a correction row. This is shown below where claim_id 5 needs a -$50 correction.
But in order to know how to set up your data flows, you need to understand how the data itself is created and stored. So making sure you go to the source and understand how it operates is critical when you model your data (of course, you should also talk to the business and understand the that side of the data model).
Overall, there are so many ways to store and process historical data to ensure you capture information and can report on it accurately.
This Article Is History
Now as I push this article, its creation gets put into a data warehouse somewhere as well as all your clicks. I hope you understand how various companies model their data.
Everything is a trade-off and you want to make sure you pick the right solution for the right job!
With that, I want to say thanks for reading!
If your team is looking for and advisor to help them improve their data strategy or data infrastructure, then feel free to sign up for a consultation.
Join My Data Engineering And Data Science Discord
If you’re looking to talk more about data engineering, data science, breaking into your first job, and finding other like minded data specialists. Then you should join the Seattle Data Guy discord!
We are close to passing 7000 members!
Join My Data Consultants Community
If you’re a data consultant or considering becoming one then you should join the Technical Freelancer Community! I recently opened up a few sections to non-paying members so you can learn more about how to land clients, different types of projects you can run, and more!
Articles Worth Reading
There are 20,000 new articles posted on Medium daily and that’s just Medium! I have spent a lot of time sifting through some of these articles as well as TechCrunch and companies tech blog and wanted to share some of my favorites!
How I build an ETL pipeline with AWS Glue, Lambda, and Terraform
This month, I decided to work on building an ETL pipeline using some well-known AWS Service.
However, I couldn’t do this pipeline using the AWS console.
Even though using the AWS console is easy to use and present advantages, it is more suitable for..
End Of Day 128
Thanks for checking out our community. We put out 3-4 Newsletters a week discussing data, tech, and start-ups.
I've using something like the date stacking in a project of mine right now. Our warehouse takes a snapshot of Salesforce data every day, so every opportunity exists on every snapshot day.
To make that data more usable in tableau we decided "ok we really only need the granularity of Opportunity ID in the current quarter. For all prior quarters, we can aggregate the data up to a 'dimensional key'.
The dimensional key is Snapshot Date + Dimensions (geography, sales segment, stage, etc). I guess you couldnt call this date stacking in its pure form? But it cut down the size of the data source by over 70% which makes tableau... Compliant
Good one!
I have mostly worked on the one big struct/array table as mentioned by Zach. I believe that will be the future, especially upstream models are now super easy to do with event driven systems, everything is stored as a log.
Also, the moderm DWH now support very fast access to nested fields.