The Baseline Data Stack - The Different Types Of Data Stacks - Part 2

From Open Source To Paid Solutions

Jan 26, 2022

∙ Paid

When building your data infrastructure you need to pick the right setup tools and general set-up. There are a lot of different best practices, cloud service providers, and just general solutions that anyone can pick.

So which type of data stack will you go with?

There are plenty of options when you start generally categorizing data stacks.

Your data stack might just be a replicant database where you just copy your core database and analyze the data with whatever tool works best.

But once you graduate from this data stack you can pick from a whole host of other data stack types. You could go all brand name or go open source.

All of these could be good options depending on your teams goals, preferences and in general how well a sales person talks up their tool.

So let’s go over some of the common data stacks I have seen.

The Replicant Database

In the last newsletter, I discussed how one way a software engineer might provide data to an analyst is with an Excel/CSV extract. Another way a lot of software engineers provide data for analytics is by simply creating a replicant database.

Just a copy-paste(Basically).

This can be done with a very simple script daily. The benefit of this is there is very little engineering that needs to occur to get this replicant database.

In turn, your analysts can quickly start working on creating reporting and running analysis on the data. Meaning your business can get the answers it wants quickly.

If this is such an easy method and can provide a quick ROI, then why don’t companies just all make replicant databases. Well, there are a lot of major drawback to this method.

For example, one major issue that can come up depending on how your application database is set up is tracking historical changes.

Let’s say you have a user table and that user table has a field like the home city. Great! Well, to track that information, it’s not uncommon to simply update that field when a user has a new home city. Meaning you will lose the previous city.

So if a manager asks you to count the number of users per city year over year, you can’t answer this question. At least not without being very wrong…

Another major issue of course, is a lack of data centralization. One of the major goals of creating an analytical data storage system like a data warehouse is that you can centralize all your data into one place. Meaning that if you want to run a query on data from Facebook ads, your own internal database as well as four other applications, you can.

Of course, there are still other pros and cons not referenced. When starting, having a replicant database can be ok. However, inevitably you will likely need to go back at some point and re-work your data infrastructure to deal with some of the issues referenced above.

The Open-Source Data Stack

Another common choice for data stacks is to rely on 100% open source and free to nearly free options.

For example, a team might choose to rely on Postgres, Airflow, Airbyte, Datahub, and Metabase for their baseline data stack.

These are all open-source projects that have varying levels of cost structures. But in general, most have a free option.

When companies are looking to reduce solutions costs and have a strong engineering team, open-source solutions can be a great option.

Some tools and solutions exist out there that can help companies reduce the engineering overhead by acting as an orchestrator of all of these tools.

An open-source is a great option and if your company can support the engineering overhead, it’s not a bad route. I have seen many companies prefer this route for varying reasons. Some people just like the notion of open-source. Other’s don’t want to spend as much time working with account executives and dealing with their own internal bureaucracy to get a new technology (which can sometimes be subverted by open-source tools).

Overall, this is a valid solution.

Keep reading with a 7-day free trial

Subscribe to SeattleDataGuy’s Newsletter to keep reading this post and get 7 days of free access to the full post archives.