Discover more from SeattleDataGuy’s Newsletter
Mistakes I Have Seen When Data Teams Deploy Airflow
What You Should Consider When Working With Airflow
Airflow remains a popular choice when it comes to open-source orchestration tools.
When I surveyed people about a year ago now, it was the most popular open-source solution, and still to this day, my video on “Should You Use Airflow” drives a lot of prospect conversations.
Now, I do want to say that there are plenty of organizations using Azure Data Factory and Informatica, and there are plenty of competitors knocking on Airflow's door.
But for now, Airflow is like the PHP of the data world; people can talk poorly about it, but it continues to be heavily relied upon.
Now, as I said, Airflow is often why I get brought into many projects, meaning I have seen many different ways that teams decide to deploy Airflow.
Some scaled, others didn’t.
Thus, I wanted to take a moment and discuss some ways I have seen Airflow deployed in the past and the challenges people faced as they deployed their code.
DAG Folder, Scheduler and Web Server in One Repo
Airflow has always been deceptively easy.
You could do all of that easily.
But if you don’t have the DAGs in a separate repo from your Airflow deployment, then you’ll need to deploy a quick change to a DAG and guess what will happen: You’ll have to deploy your entire project.
In other words, you’ll likely be taking down your webserver and scheduler for a few seconds or minutes just to push some code changes. During that time, the jobs that are currently running may or may not continue. So, if you’ve got a job that’s been running for 15 hours…
It’s starting over.
This is far from ideal behavior.
Instead, your DAGs folder should be a separate repo, or perhaps several repos that then likely push to a centralized location like an S3 bucket that is then likely pulled into a file system that is attached to your Airflow instance(s).
I have seen this approach both at some clients and in Shopify’s and Scribd’s articles. I’ll go over some other points that Shopify went through later; for now, I’ll link to Scribd’s article. This article was more focused on discussing breaking up a DAG mono repo, but you’ll also get an understanding of how they structured their DAGs folder.
Overall, replicating the problems you will face when deploying Airflow is hard, but knowing more about Airflow isn’t.
SeattleDataGuy’s Newsletter is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.
Not Using Features Airflow Provides
There are tons of great features that Airflow provides that sometimes get missed. Perhaps it's because teams are pushed to deliver or because tutorials don’t cover it, but I have seen a few cases where what could be considered basic to mid-functionality in Airflow gets glossed over and completely rewritten from scratch.
In particular, Hooks and Variables are two that come to mind.
Both of these features are self-explanatory. But if you’re just learning Airflow in the middle of having to deploy and hit deadlines, I think they can get missed.
Now, instead of talking about a mistake, I wanted to talk about a project that implemented Hooks and Variables. There isn’t much to say other than it made my development process smooth.
Instead of constantly searching for Variables or wondering if a connection would work, I, as a data engineer, flew through development because I could easily point to the correct abstracted data sources and variables.
Just in case you haven’t been exposed to it, here is a quick background on hooks and variables.
Hooks are interfaces to external platforms and databases like Hive, BigQuery, and Snowflake.
They are used to abstract the methods to connect to these external systems. This means that instead of constantly having to reference the same connection string over and over again, you can pull in a hook. Like the example below.
s3_hook = S3Hook(aws_conn_id="aws_conn")
h = HttpHook(method='GET', http_conn_id=hubspot_conn_id)
Why are Hooks useful:
Abstraction: Instead of having custom code in each DAG for each type of connection, you can use Hooks to avoid writing a different connection string for every database/source.
Centralized Management: Connection information is stored centrally in Airflow's metadata database. This means that credentials, host information, and other connection settings are managed in one place, which helps in security and maintenance.
Extendability: Airflow has a lot of pre-built Hooks for popular systems, but if you have a custom or niche system, you can create a custom hook.
Variables are a way to store values as a simple key-value store within Airflow. In addition, they can help encrypt data that needs to be secure and isn’t being managed by a connection ID. You can see a few examples below of how some teams might use variables.
Why are Variables useful:
Dynamic Configuration: Instead of hardcoding specific values in your DAGs, you can use Variables, making updating configurations easier without altering the DAG code.
Security: Sensitive data can be stored as an encrypted variable within Airflow. This means secrets or passwords can be kept out of the DAG code and stored securely.
Reusability: If multiple DAGs require the same piece of information, instead of replicating the data, you can store it as a Variable and reference it in the needed DAGs.
With Airflow constantly rolling out new versions, there are probably a few features that I have missed that would, once again, make my job easier. So do keep up to date with the new Airflow roll-outs!
Not Preparing For Scale
If you’re rushed to deploy Airflow, you might not realize, or you might not take the time to think through, how workers and schedulers will scale out as more jobs start to be deployed.
When you first start deploying Airflow DAGs, you won’t notice this issue. All your DAGs will run (unless you have a few long-running ones) without a problem.
But then you start having 20, 30, 100 DAGs, and you’ll notice that DAGs will be sitting in the light green stage for a while before they run. Now, one issue might be that you need to change some configurations in your airflow.cfg (if you can’t tell, this is your new friend when you use Airflow), but another might be that you’re using the wrong executor.
So, this is a great time to review some of the executors that exist:
SequentialExecutor - runs tasks one at a time, ideal for debugging.
LocalExecutor - permits parallel task execution on a single machine.
CeleryExecutor - distributes tasks across multiple nodes using the Celery framework.
KubernetesExecutor - dynamically allocates tasks as isolated Kubernetes pods, ideal for cloud-native setups.
DebugExecutor is tailored for in-depth task debugging using the Python debugger.
But the truth is it’s not even that simple. As Megan Parker from Shopify put it in the article: Lessons Learned From Running Apache Airflow at Scale
There’s a lot of possible points of resource contention within Airflow, and it’s really easy to end up chasing bottlenecks through a series of experimental configuration changes. Some of these resource conflicts can be handled within Airflow, while others may require some infrastructure changes.
Following this section, she discusses Pools, Priority Weights, Celery Queues, and Isolated Workers, and perhaps the solutions that Shopify came up with aren’t even the best fit. I think my favorite is the line:
“through a series of experimental configuration changes.”
This isn’t a programming issue.
You’ll need to consider multiple factors to set-up Airflow for scale. I could pretend to know every solution, but what I have used in the past might not work for everyone, and your fix might require a little testing before confirming you can now scale.
Airflow Is Easy Until It’s Not
Airflow is nearly ten years old and has continued to be adopted at companies ranging from start-ups to enterprises (I say that because some people don’t think enterprises are using it). Anecdotally speaking, I have worked with several this year alone.
That being said I’d put money down that Azure Data Factory is probably doing more in terms of total jobs.
Regardless, Airflow is challenging to manage in production. Perhaps that’s why there are so many alternatives to Airflow popping up.
Don’t let the basic DAG tutorials fool you. Building DAGs is easy. Managing and deploying Airflow is hard.
But, I guess that really is how it is with all things.
Your first website is easy to build because it's local.
Building an ML model is “easy-ish.”
But putting things into operation is always hard.
Thanks for reading!
If you need help with your Airflow set-up or data strategy, then feel free to set-up some time with me and my team!
Video Of The Week: Do Data Engineers Need To Learn Docker?
Articles Worth Reading
There are 20,000 new articles posted on Medium daily and that’s just Medium! I have spent a lot of time sifting through some of these articles as well as TechCrunch and companies tech blog and wanted to share some of my favorites!
Selective Column Reduction for DataLake Storage Cost Efficiency At Uber
As Uber continues to grow, the sheer volume of data we handle and the associated demands for data access have multiplied. This rapid expansion in data size has given rise to escalating costs in terms of storage and compute resources. Consequently, we have encountered various challenges such as increased hardware requirements, heightened resource consumption, and potential performance issues like out-of-memory errors and prolonged garbage collection pauses.
End Of Day 96
Thanks for checking out our community. We put out 3-4 Newsletters a week discussing data, tech, and start-ups.