Data Modeling Where Theory Meets Reality

How Different Companies I Worked At Modeled Their Data

May 26, 2024

∙ Paid

Data modeling varies at different companies.

Even companies that all say they are using dimensional modeling or at least use terms like facts and dims might differ in the way they implement it.

Over the past near decade, I have worked for and with different companies that have used various methods to capture this data. I wanted to review some of the techniques that are commonly used to model data for analytics.

This is part of my unofficial series on data modeling, so if you’d like to learn more, then you can check out some of the prior articles such as this one.

But for now, let’s dive in.

Tracking Changing Dimensions

If you’ve queried a data warehouse before, there is a good chance you’ve come across slowly changing dimensions. Perhaps you didn’t realize it but if you saw some form of start_effective_date, end_date, effective_date, or end_date combination, that’s likely what you were seeing.

Now in the wild, I have really mostly seen SCD (slowly changing dimension) Type 2 and some hybrid versions of Type 6. But mostly Type 2.

Type 2 inserts a new row for each change and captures when those changes occur.

The common example I use to explain how SCD Type 2 works is if you want to report on the number of days a person works in a specific role.

From a business standpoint you might want to answer how long it takes for different teams to promote an individual. In order to do so, you need to know when a person changed from one level or role to another.

So you might store your data like the one below:

I have seen type 2 implemented both NULL and with a default date often 9999-12-31, the nice thing about a default date is it’s simpler when writing a BETWEEN clause.

This captures when certain changes happen and how long they last. This means you can safely report on the question we gave earlier(How many days on average does it take for an employee to switch roles).

There is another type of SCD called Type 1, which is simply updating the information by replacing the old information. The problem here is you lose the prior information meaning you can no longer report on it accurately. Because in our example above, you’d just replace John’s job_title with data engineer and perhaps change an update_date column(if that exists). But you lose the fact that John was a data analyst.

In turn, that’s why I’ve mostly come across SCD type 2. But it’s not the only way to capture data that changes.

Facebook And Date Partitions

When I first started working at Facebook, one of the differences I noted was that, unlike most other companies that I worked at, they tended to take snapshots of dimensional information every day(honestly, they also used the term “dimension” loosely"). In fact, I have noticed that many tech focused organizations seem to reference something similar. But not everyone is sure if they are bought in.

Quick Pause - As 2024 is coming to a close I am planning a lot more content for 2025. In particular I want to give paying subscribers even more. Meaning private AMAs and videos discussing the data industry.

If you’d like to support this newsletter and get access to future paid content, then until December 31st 2024 you can use the discount below to get 55% off forever.

Thanks all!

Get 55% off forever

Keep reading with a 7-day free trial

Subscribe to SeattleDataGuy’s Newsletter to keep reading this post and get 7 days of free access to the full post archives.