I have worked with several companies that initially started working with their OLTP to provide analytics. This will generally work early on, but these systems aren’t optimized to run complex queries.
Databases like MongoDB and CassandraDB also add in the complexity of not being SQL-friendly, which most analysts and data practitioners are accustomed to using. This further makes these data systems a poor choice in the long run for performing analytics.
Why does it really matter what type of data storage system you use?
Data is data, and if you want to ask a question to your transactions database, you should be able to, right?
Yet we generally distinguish between OLTP and OLAP.
Also known as online transaction processing vs. online analytical processing, both are different use cases that benefit from contrasting designs and underlying software. This is why solutions like Teradata and Vertica were massive back in the day(and honestly still play a large role in many enterprises). Teradata was actually one of the first data warehouse systems to manage a TB of data for Walmart.
But the question becomes, why?
Why do we need to duplicate our date and remodel it for analytics?
In this article we will discuss some of the reasons why you will need to eventually develop an analytical system to answer your business questions.
Access Patterns
In order to understand the why, we first need to discuss access patterns. Access patterns refer to how data is accessed in a database system.
In OLTP systems, the access patterns are transactional, meaning that data is accessed and modified frequently in small amounts. Think about yourself as a user needing to update your profile or create a post on your favorite social media app.
You don’t need to interact with 100,000 rows (except maybe if you need to update Twitter feeds) no you likely only need to update your information.
You’re only interacting with a small subset of the data for your single transaction. But there are likely thousands of these transactions happening at once.
Usually these are short-lived transactions that fall in the inserting, updating, or deleting of small data amounts(often referenced as CRUD).
On the other hand, in OLAP systems, the access patterns are typically analytical, meaning that data is accessed and analyzed in large amounts. OLAP systems are designed to handle complex queries that involve aggregating and summarizing large volumes of data.
These queries are usually read-only.
And instead of having thousands of users querying your database all at once you might have a dozen or so data analysts and data engineers querying the data.
Think about your baseline aggregate query where you want to find the average time your users spent watching videos on cats. That will probably hit a lot of your data storage system all at once, all to calculate how much time we spend watching cats.
Unlike OLTP, these queries don’t happen as often but they hit a large set of data.
Now that we understand the access patterns, it might be a little clearer why the systems are designed slightly differently.
Rows vs. Columns
"Not every data warehouse (OLAP) is necessarily a column store: traditional row-oriented databases and a few other architectures are also used." - Designing Data-Intensive Applications
DDIA is right; not every data warehouse(OLAP) needs to be a column store and not every OLTP is row based. In fact, I got into an argument during an interview because I made the point that data warehouses didn’t have to be columnar.
Why then do data professionals consider column store data warehouses, true data warehouses?
Now if you haven’t heard of the term columnar database before, it’s sort of like the name suggests–the data is stored by column vs. by row.
There are several benefits that analytical queries get from having the data stored by column versus. by row. But to understand that, it’s also helpful to apprehend how traditional row-based databases operate.
Your standard RDBMS stores data in rows, meaning all the information for said row is stored together. This makes it easy to retrieve because there is only one location that you need to pull the data from. This makes row-based databases excellent for transaction systems that often only interact with one row at a time (unless you’re running a massive delete or update).
But analytical queries generally don’t need to pull just one row, as discussed before. Instead, they might need to pull millions or billions of rows.
There is a change here, though. Instead of needing all the information on a user or transaction, many analytical queries are actually pretty straightforward (until you start building complex algorithms). But the basic BI and reporting queries answer questions like:
What was the total revenue broken down by store, product, and color?
What is the average view time on videos by category?
Which type of customer is most likely to spend more than $100?
These queries are pretty straightforward and only require a few columns. In other words, pulling all the data is unnecessary and inefficient in terms of I/O. This is not only a problem if you store the data by row, but also if you store your data by column.
Suddenly this is less of an issue.
Now, only the required columns need to be read and processed, rather than returning the entire rows. This can greatly reduce the I/O and processing requirements for large analytical queries, which are common in data warehousing.
Columnar databases also typically have advanced compression techniques specifically designed to reduce storage requirements and improve query performance. Additionally, columnar databases can leverage parallel processing and vectorization, enabling even faster query processing on modern hardware.
Now that we’ve handled the technical change, let’s talk about the logical modeling adjustment.
Data Modeling
In OLTP systems, data modeling focuses on representing the latest operational data, as well as being efficient in terms of managing the thousands of updates that could be happening every second. Thus, the data model is designed to support transactional processing and ensure data integrity.
OLTP systems typically use normalized data models, where the data is organized into highly structured tables, and the relationships between the tables are defined through primary and foreign keys. This allows for efficient data access and manipulation, which is critical for transaction processing.
In contrast, OLAP systems require a different data model type to support complex analytical queries. The data modeling in OLAP systems typically involves denormalization, which involves combining and summarizing data.
This can help simplify the analytical queries the end-user needs to write and allows for faster data retrieval. You’ll often hear terms such as star or snowflake schema be used when a bunch of data engineers and data architects are trying to design their data warehouse.
Also, you could hear reference to OBT or one-big-table, which, as the name suggests is a very wide table that has fact and dimensional data all jammed into a single table. In my experience I will often see data teams use a combination of dimensional model that eventually gets put into OBT for analysts to use.
Besides being more efficient, these models generally tend to be simpler to understand from an analyst’s perspective. In fact, I often will see many data teams take their star or snowflake schemas and create one big table to try to reduce the number of joins required to pull data (of course, sometimes analysts don’t find this helpful).
So in theory the hardware, software and model your data could run on for transactions vs analytics might be 100% different.
And there might be a valid reason.
I could even go more in depth on other concepts such as ROLAP, MOLAP and HLAP. But for now I will stop there and hopefully its a little clearer on why.
Why Do We Replicate Data?
Can you use a replica Postgres database to run your analytics even though it's technically an OLTP?
100%.
I have several current clients who have been running their analytics like this for a while; I even wrote about it. But eventually, your company may need to consider finding a better solution. This is generally driven by performance issues or the need to create a data layer that analysts can more easily work with.
And hopefully, you now have a better understanding of why we make the distinction between OLAP and OLTP. Many companies are trying to eliminate this difference by creating hybrid tables or making custom query languages that can query different types of data stores. But as for now, once you have a large enough data set, you’ll need to duplicate your data.
Thanks for reading!
Video You May Have Missed - Data Engineering Vs Machine Learning Pipelines
Articles Worth Reading
There are 20,000 new articles posted on Medium daily and that’s just Medium! I have spent a lot of time sifting through some of these articles as well as TechCrunch and companies tech blog and wanted to share some of my favorites!
Trino: Open Source Infrastructure Upgrading at Lyft
As a data-driven company, many critical business decisions are made at Lyft based on insights from data. To support the various use cases of data scientists, analysts, and engineering teams, the Data Infra team at Lyft runs Trino as the main company-wide service to empower interactive query scenarios. Beyond that, Lyft encourages teams to leverage other major data features of Trino, such as dashboarding, reporting, and large volume Extract, Transform, Load (ETL) jobs. Our users and services send more than 250K queries with 50PB of processed data each day. Given the growing usage of Trino, the Data Infra team is dedicated to providing the latest Trino features and infrastructure upgrades to our internal customers so that we can more efficiently support their business scenarios.
Squeeze more out of your GPU for LLM inference—a tutorial on Accelerate & DeepSpeed
During my most recent project, I wanted to test how current large language models (LLMs) from Hugging Face perform on code generation tasks against the HumanEval dataset. I spun up a P3 instance from AWS with a V100 GPU and wrote a test framework using Hugging Face and PyTorch. The framework loaded each model checkpoint into the GPU, ran batch inference, then scored the model based on pass@K¹. It was smooth sailing until I hit the model EleutherAI/gpt-j-6b. Its checkpoint is 24.2GB — bigger than V100’s 16 GB memory — so of course it threw a CUDA Out Of Memory error right at my face.
End Of Day 81
Thanks for checking out our community. We put out 3-4 Newsletters a week discussing data, tech, and start-ups.
Very well explained :)
If we create replica database (as is), can we remodel the data from OLTP to OLAP? Hows the effort like?