Here is something school probably didn't teach you about SQL
Or at the very least you likely forgot.
When you write a query, hit submit, and then run the query or that little triangle in DBBeaver…
…what exactly happens?
Sure, you likely understand that data is pulled from multiple tables, data is filtered, and aggregations occur.
But behind the scenes, what is going on?
How does SQL go from English into the lingua franca of data?
In this article we will answer that question.
Query Execution Components
Let’s go through how your query gets parsed, and concepts like an execution plan.
My goal is to try to make this more exciting than my professor did in my database course.
Also, I am going to use the term SQL engine instead of RDBMS just because you have solutions like Trino or other similar solutions that are not a RDBMS.
Parser and Validation
The very first step in the process is lexical analysis(another term for this is tokenization). The goal of lexical analysis is to break down the SQL query into a series of tokens. Tokens are the smallest units of meaning in the SQL query, such as keywords, identifiers, operators, literals, and punctuation marks. This means your SELECT, FROM and WHERE for keywords and tables, and columns for identifiers, and <, = and <= for operators.
From there, you’ll likely have two more steps, depending on the SQL engine you’re using.
Syntax Check - This ensures that the SQL query follows the correct grammar and structure as defined by the SQL language. It identifies syntax errors like missing commas, misplaced keywords, or unmatched parentheses and that keywords are in the proper order.
Semantics Check - The SQL engine checks the query for semantic correctness. This involves checking whether the tables, columns, and other database objects referenced in the query exist and whether the operations are valid (e.g., ensuring data types are compatible).
Now as I said, this is dependent on what SQL engine you’re using. For example, here is an expert from Understanding MySQL Internals,
MySQL’s parser, like many others, consists of two parts: the lexical scanner and the grammar rule module. The lexical scanner breaks the entire query into tokens (elements that are indivisible, such as column names), while the grammar rule module finds a combination of SQL grammar rules that produce this sequence, and executes the code associated with those rules. In the end, a parse tree is produced, which can now be used by the optimizer. - Understanding MySQL Internals.
The end result will be something like the parse tree below.
Now on to the next step!
Quick Pause!
If you’d like to help support this newsletter and get access to all my newsletters and future paid articles.
With that you’ll also get access to:
All past and future newsletters
Private AMA with me for paying subscribers only
Future webinars discussing data leadership, strategy and infrastructure topics
Translator/Algebrizer/Query Transformer
Keep reading with a 7-day free trial
Subscribe to SeattleDataGuy’s Newsletter to keep reading this post and get 7 days of free access to the full post archives.