If you missed it, I had a live show with
a few weeks ago where he referenced query-driven data modeling.Now he was referencing the standard:
Conceptual
Logical
Physical
but adding the 4th - Query-Driven-Modeling
According to Reis, many companies no longer perform the first two and often just go straight to developing queries to support a one-off request. Thus we are likely not even using any form of star schema or even asking ourselves whether we are building a data mart or data warehouse.
Instead, we are likely just building tables to support a single request. Or, once again, as Joe called it, Query-Driven-Modeling.
But what is query-driven modeling, and does it have a place in the data world?
Query-Driven-Modeling
Now, there isn’t an official definition for QDM, or JIT-data models, although we did in the past have the term ‘schema-on-read," which is somewhat similar.
But I think many of us know what it is.
It’s the building of tables or data sets in reaction to a single stake-holder’s request. There may be some form of requirements gathering, but only for the one stakeholder.
Benefits of Query-Driven Modeling
Now, like any choice you make as an engineer, there are pros and cons.
Speed To Initial Insights - The clearest benefit of developing under a query-driven approach is time-to-insights (at least in the short term).
Self-Service - Tools like dbt have been a great door into data for many teams. They have also helped speed up the ability for analysts and those who are SQL proficient to go from query to table. In turn, this has also led teams to reach the “Holy Grail” of self-service considerably faster. When I first started in the data world, I recall reaching out to the EDW team with a query I had built and needed implementing. I had to wait 3-4 months to see it deployed…as a view. This wasn’t anything crazy, either. Many analytics teams likely can’t operate effectively in that environment and, in turn, will likely create a shadow IT team and data warehouse anyway.
Stakeholders Will Be Happy In The Short Run - Many data teams get stuck in loops of constantly developing JIT because stakeholders often face pressure to do so. Managers and directors need numbers to give to their VPs, VPs need numbers to give to the C-Suite, and the C-suite needs numbers to give to the board. And the constant downward pressure pushes down on newly minted analysts or over-worked data engineers who want to/are forced to deliver what their managers ask.
Thus, using query-driven modeling is the answer. You can query from raw data sets, run a few gut checks on the data, and boom.
Answers.
Everyone is happy.
Cons Of Query-Driven Modeling
There is always a cost for teams deciding to use a JIT approach to their data models. And to be clear, whether this is a conscious decision or not, it is made. There are clear trade-offs.
Decreased Agility When Changes Are Required - Although development time is faster, the ability to pivot and change what a system is doing will worsen as more models, dependencies, and tech debt are built up. Each new weak link in the complex system that is developed leads to a possible failure point that no one might be aware of.
Lack of Coherent Metrics - When data teams take on a JIT approach, there is a good chance that the same metrics will be developed by several teams using slightly different methodologies, leading to the classic issue where team numbers don’t match. Many companies I have seen, including FB, in turn, create some form of metrics layer or portal that would help define key metrics and call out where they are being used.
Spaghetti Pipelines - When JIT is heavily relied on, the data pipeline systems that are created can quickly become spaghetti pipelines. You’ll find yourself 18 DAGs in only to figure out that the customer_category field you thought came from source A was actually populated by Destination B and eventually made its way back to Source A (I have seen it).
Less Robust Data Sets - Besides the risk of self-dependency. Another issue is the fact that making even what might be perceived as a small change to one pipeline might have large consequences. Since no table is defined as core-per-say and governance is minimal, it’ll be hard to fully understand the weight of a change. Now we do create solutions for this, such as data lineage; but to some degree, if you’re overly relying on data lineage to track down how important a data set is, you likely already have an issue.
Cost - When I first started in the data world, my stakeholders told me they wanted real-time data for their dashboards. So I clicked “live data” on my Tableau dashboards and published them.
Well, as you can likely imagine, if you’ve been in the data world for even two or three years, it wasn’t long before I had two consultants come down and call this out as causing a decently large spike on the servers. The truth is tools like Tableau and dbt solve problems, but they can also cause new ones. They are so easy they can often increase not only the speed of insights but also the speed of bad decisions and computer usage, leading to greater costs.
With all those cons, does Query-Driven Modeling have a place in the data world?
Does Query-Driven Models Have Any Place In The Data World
I do think there are places for a more JIT approach to creating data sets. This is where the role of an Analytics Engineer really does make sense for some teams.
When companies have larger data teams and clear initiatives per department regarding how they want to use data, using a JIT approach combined with a more traditional core data model makes sense.
This was what we essentially had at Facebook.
The team I worked on created what I considered “Data As Infra.” That is to say the tables we built were relied upon by many other teams at Facebook. Other data engineers were building their insights and analytics based on what we built.
It got to the point where we wouldn’t be able to manually go through and understand the impact of making small changes. This meant we had to treat our data more like infrastructure, not just like a table you can change because of a single stakeholder’s request.
However, if you went one team away from us, you’d find a combination of very technical analysts or business-focused data engineers needing to create more one-off data sets supporting clear business asks (Now, this occasionally made it difficult to directly point to our impact since it was all so encompassing). But treating core tables more as infrastructure with policies and governance in place to ensure a small change doesn’t break all of the dependencies makes sense. Obviously, this is not a perfect model, and I have seen plenty of small organizations run fine with JIT modeling and others that use OBT, but none that really discuss conceptual modeling.
But generally speaking, you’ll eventually need to pay for the tech debt, whether that's with a code re-write or continuing down the path and finding you reach a point where this meme is reality.
If you are in a place where you need help assessing your data infra or data strategy, then set up a consultation today!
This happens all the time in many modern data teams often because analysts and data scientists are forced to develop their own data pipelines. They generally focus on answering a specific set of requests.
Not developing robust data sets can support multiple use cases.
Design Is An Afterthought
Building data sets and workflows exactly when required with little to no requirements gathering has its benefits the same way extreme prototyping has its benefits. But for every benefit, there are tradeoffs.
Data sets become part of complex dependency trees, the business requests constant changes, and, in turn, the queries are constantly deteriorated to the point where they are likely maintaining far more in terms of usage weight than initially tested for.
Take, for example, a beam being put under new stresses in a building.
Eventually, it’ll collapse and likely take out several other beams with it.
Is query-driven modeling a type of data modeling? Sure.
But you better know the risks.
Thanks for reading.
Different Types Of Data Engineering Roles - You Don't Have To Have Data Engineer In Your Title
Articles Worth Reading
There are 20,000 new articles posted on Medium daily and that’s just Medium! I have spent a lot of time sifting through some of these articles as well as TechCrunch and companies tech blog and wanted to share some of my favorites!
Attribute-Based Access Control at Uber
Uber relies on microservices to support its operations. Microservices necessitate a flexible authorization policy model to satisfy their unique authorization requirements. Attribute-based access control (ABAC) offers a dynamic, context-aware, and risk-intelligent approach to access control. By leveraging ABAC, access control policies can be crafted based on specific attributes obtained from diverse information systems. This enables Uber to establish a sophisticated access policy management system that facilitates access in a manner that promotes least privilege, enhances efficiency and effectiveness, and, most importantly, maintains consistency across dissimilar information systems when managing access.
Taking Charge of Tables: Introducing OpenHouse for Big Data Management
At LinkedIn, we build and operate an open source data lakehouse deployment to power Analytics and Machine Learning workloads. Leveraging data to drive decisions allows us to serve our members with better job insights, and connect the world’s professionals with each other.
Open source data lakehouse deployments are built on the foundations of compute engines (like Apache Spark, Trino, Apache Flink), distributed storage (HDFS, cloud blob stores), and metadata catalogs / table formats (like Apache Iceberg, Delta, Hudi, Apache Hive Metastore). End-users create relational entities in the form of Tables over structured or semi-structured data using compute engines, with the metadata for a Table stored in a catalog, and data stored in distributed storage.
End Of Day 90
Thanks for checking out our community. We put out 3-4 Newsletters a week discussing data, tech, and start-ups.
Isn't this pattern well-served by Data Virtualization? Basically, the virtualized views exposed by the Data Virtualization layer are the JIT tables that Stakeholders demand. Thoughts?
Could try the Kimball lifecycle approach and try to model a business process? Then you have data that can be used for more than one report. And who needs conceptual, logical, physical model? With dimensional modeling, they are pretty much the same thing. I don’t need E/R either. Relational modeling is good enough for me.