What Is dbt and Why Are Companies Using It?
And What Do Data Engineers Do On A Daily Basis - Community Update #24
It’s about to be a decade since the Harvard article touting data science as the sexiest job in the 21st century. So I believe we need to get past saying “data is the new oil” and start building maintainable data stacks.
This is why recently my articles have focused around the modern data stack and the role these tools can play when building maintainable systems in the future. Whether it be Astronomer, Fivetran, Snowflake or Starburst Data. All of these tools are trying to migrate a lot of the heavy lifting that data engineers have been performing for the past few years onto the tools.
One company that saw an opportunity to help simplify data workflows was Fishtown Analytics. Perhaps you have heard of this tool in passing.
Maybe you recognize the abbreviation.
dbt(Data Build Tool) is used to help transform data and build data pipelines while also making the process faster and more accessible. But what is dbt? What can it do, and why should your company use it? Let’s discuss these and more.
What Is dbt?
Built by Fishtown Analytics (now dbt Labs), the data build tool or dbt is a command-line tool that allows data analysts to execute the “transform“ step in the Extract-Load-Transform pipeline. They can do this by writing the dbt code in their preferred text editor and then running dbt from the common line. dbt then transforms the code in SQL and executes it against its company’s database. Here is an example:
Dubbed as the “analytics engineering tool,” dbt is also open source and has become a regular part of the modern data stack of many companies. It also has had a solid support community formed around it.
dbt's Recent Funding
dbt Labs, the company behind dbt, raised $150 million, led by Sequoia Capital, Andreessen Horowitz, and Altimeter through a round of Series C funding valued at $1.5 billion. This round raised capital through the sale of preferred shares. It gave the holders the right to exchange them for common stock in the company in the future. Before this recent funding, dbt Labs had raised $42 million in two separate rounds of funding. With the current cash injection, the company expressed it will double its efforts to develop its core open source platform — dbt.
This sentiment is echoed by dbt Labs founder and CEO Tristan Handy when he said:“Right now, our focus is on improving our core offering and supporting its exponential growth as the foundation of one of the highest-growth areas in all of the enterprise software. We also have our eye on some experimental new areas of product development, but nothing we’re ready to share yet.”
What Are The Key Technical Steps In Using dbt?
Users of dbt create projects made up of a .yml format project file and one or more .sql files (models) that each contains a SELECT statement. The project then performs the necessary operations on a data warehouse. The standard version of dbt supports BigQuery, Postgres, Redshift, and Snowflake warehouses, though many users have created and shared “adapters” to use other warehouses. Finally, dbt outputs the required data in the required format, ready for analysis.
Key Features Of dbt
To make it easier to use and more productive, dbt has several important built-in usability features. (A thriving community of users also shares tips and develops new ways to better use dbt, including 15,000 members of a Slack community.)
Visual Representations Of Tasks Or DAGs
dbt can automatically build graphs, known as directed acyclic graphs (DAGs), that show the dependencies between the different models in a project. One benefit is that it’s simpler for data analysts to keep track of the dependencies, spot any errors and figure out what’s behind any problems. Another benefit is that it’s easier to demonstrate and explain a project to staff in other departments or supervisors who don’t have specialist data skills. The graph is a much more intuitive way to explain and understand what analysts are actually doing with data.
Quality Tests
dbt has a built-in tool for testing data quality, picking up problems that could make a particular query impractical or compromise its output. While many of the tests are straightforward concepts such as looking for any unwanted null or duplicate values, the built-in test tools work particularly smoothly. That’s because running the test tool automatically creates queries for each test on a project-by-project basis using the .yml files.
Version Control
While we’ve talked a lot about an individual data analyst running dbt, in practice many businesses have analysts collaborating on projects, often working at different stages simultaneously. The version control feature makes it much simpler to keep track of this workflow while breaking it down into small, logical steps. Mistakes or conflicts don’t have to be fatal as it’s simple to revert backwards to a previous version without undoing all the work that did go as planned.
Sandboxing
Formally known as environment management, this makes it simple to work on data completely separate from any other user. This gives users the reassurance that they can’t unintentionally alter the original raw data or the output that anyone else is working on. It also removes the confusion that can arise when analysts work on other people’s projects, for example reviewing their code.
Should You Use dbt?
If you or your staff work on data analysis, dbt can be a hugely efficient way to better use your data without the wasted time and technical confusion of manually performing data transformation. It means a business can hire people based primarily on their analysis expertise and take full advantage of it.
Most data analysts should be able to get to grips with dbt relatively quickly as long as they are comfortable with SQL and SELECT statements. It’s also a big advantage to know Git, particularly for managing workflows. Remember that dbt is simply a tool and doesn’t replace the need to have a clear idea of why you are analyzing data and what you want to find out. It simply removes some of the grunt work that’s necessary before you can exercise your skills. Ultimately, dbt really is about transformation: turning the data pipeline from an obstacle to an opportunity, freeing up your analysts to concentrate on realizing the maximum potential of your data.
Thanks To The SDG Community
I started writing this weekly update more seriously about 10-11 weeks ago. Since then I have gained hundreds of new subscribers as well as 22 supporters! I even got 10 special thanks on Youtube!
And all I can say is, Thank You!
You guys are keeping me motivated.
Also, if you’re interested in reading some of our past issues such as Greylock VC and 5 Data Analytics Companies It Invests or The Future Of Data Science, Data Engineering
Then consider subscribing and supporting the community!
Video Of The Week - Day In The Life Of A Data Engineer - What Do Data Engineers Do?
Data engineering is becoming a more popular career choice as the need to manage, integrate and process data grows.
But what exactly do data engineers do all day?
Do they write code?
Do they create data warehouses?
Let's talk about it
Articles Worth Reading
There are 20,000 new articles posted on Medium daily and that’s just Medium! I have spent a lot of time sifting through some of these articles as well as TechCrunch and companies tech blog and wanted to share some of my favorites!
Data pipeline asset management with Dataflow
The problem of managing scheduled workflows and their assets is as old as the use of cron daemon in early Unix operating systems. The design of a cron job is simple, you take some system command, you pick the schedule to run it on and you are done. Example:
0 0 * * MON /home/alice/backup.sh
In the above example the system would wake up every Monday morning and execute the backup.sh script. Simple right? But what if the script does not exist in the given path, or what if it existed initially but then Alice let Bob access her home directory and he accidentally deleted it? Or what if Alice wanted to add new backup functionality and she accidentally broke existing code while updating it?
The answers to these questions is something we would like to address in this article and propose a clean solution to this problem.
How To Improve Your Data Analytics Strategy For 2022
2022 is around the corner and it is time to start looking towards improving your data strategy.
Our team has seen several trends in 2021 in terms of methods which can help improve your data analytics strategy.
Whether it be optimizing your Snowflake table structures to save $20,000 a year or optimizing your pipelines to reduce load times for dashboards by 30-50x. Our team has had the opportunity to improve companies of all sizes data analytics strategy and infrastructure.
Data analytics is more than a buzz word.
Data analytics is driving companies.
Start-ups, billion dollar fortune 500 companies and single owner businesses are using data to drive their business.
5 Best Practices for managing Azure DevOps CI/CD Pipelines with Matillion ETL
Azure DevOps is a highly flexible software development and deployment toolchain. It integrates closely with many other related Azure services, and its automation features are customizable to an almost limitless degree.
Matillion ETL is a cloud-native data integration platform, which helps transform businesses by providing fast and scalable insights. Matillion ETL’s DevOps artifacts are source code and configuration files, and they are accessible through a sophisticated REST API.
Great flexibility in both platforms means it’s possible to integrate Matillion ETL with Azure DevOps in many different ways. It can be difficult to decide how best to proceed! To help with that, we present five best practices that will help you get the most out of Azure DevOps in a Matillion ETL software development context.
End Of Day 24
Thanks for checking out our community. We put out 4 Newsletters a week discussing data, tech, and start-ups.
If you want to learn more, then sign up today. Feel free to sign up for no cost to keep getting these newsletters.