Data pipelines are a key component in any company’s data infrastructure.
One framework that many companies utilize to manage their data extracts and transforms is Airflow. Whether it’s 100% using Airflow and its various operators or using Airflow to orchestrate other components such as Airbyte and dbt.
For those unaware Airflow was developed back in 2014 at Airbnb as a method to help manage their ever-growing need for complex data pipelines and it rapidly gained popularity outside of Airbnb because it was open-source and met a lot of the needs data engineers had.
Now, nearly a decade later, many of us have started to see the cracks in its armor. We have seen its Airflow-isms( Sarah Krasnik).
In turn, this has led to many new frameworks in the Python data pipeline space such as Mage, Prefect and Dagster.
Many of us still rely on Airflow. But Airflow has its fair share of quirks and limitations. Many of which don’t become obvious until a team attempts to producitonize and manage Airflow in a far more demanding data culture.
So in this update, I wanted to talk about why data engineers love/hate airflow. Of course, I didn’t just want to state my opinion. Instead, I have interviewed several other data engineers to see what they like about Airflow and what they feel is lacking.
Where We Started
When I started interviewing other data engineers on which data pipeline solutions they started with I assumed there would be some level of consistency in their answers. Perhaps most people would start in SSIS or cron and bash scripts.
However, everyone started in different places. Some started with Airflow, others with SSIS, and others data integration systems that I had never heard of. But here are a few quotes:
Joseph Machado - Senior Data Engineer @ Linkedin - “I started running python scripts locally”
Sarah Krasnik - Founder of Versionable - “I’ve actually only ever used Airflow”
Matthew Weingarten - Senior Data Engineer @ Disney Streaming - “At my first company we used TIBCO as our enterprise data pipeline tool”
Mehdi (mehdio) Ouazza - @ mehdio DataTV ,- “Airflow was the first tool I used running an on-prem system”
It may feel like there are a lot of data pipeline solutions today. But there always has been 1000 different ways to transport data from point A to point B. Whether through custom-developed infrastructure or no-code drag and drop vendor sold options. Picking which solution was best was always a challenge.
Why We Like Airflow
Why has Airflow gained such a stronghold in the world of data engineering? Taking a quote back from a 2015 article written by the creator ofAirflow:
As a result of using Airflow, the productivity and enthusiasm of people working with data has been multiplied at Airbnb. - Maxime Beauchemin
Airflow proved to be a solution that could boost productivity at a time when data engineering was constantly bogged down with one-off requests(wait this hasn’t changed) and constant migrations.
Of course, that was Airbnb, they had to adopt their own solution. So why did so many other data engineers pick it up?
Easy To Start
A great thing about Airflow is that building your first toy DAG is very easy.
First all you need to do is write a few parameterized operators.
After that you can run:
airflow standalone
Suddenly you’re running Airflow. Locally or maybe on a EC2 instance.
From there, you’re kind of done. Of course we haven’t considered scaling or the fact that your logs will blow up your storage but, for the first few months this will work alright.
Scheduling
Airflow provides an easy-to-understand scheduler. Using cron based scheduling a developer can easily set their DAGs to run daily, hourly, weekly or just about anything else in between.
From there, Airflow will take care of running the jobs. No need to go into cron to make updates on when to run scripts. Instead, you can save the schedule as part of your code. This is beneficial both in terms of making sure you don’t have to search for where the scheduling agent is as well as making it part of code.
The ability to schedule jobs so easily was a major plus for me who had spent a lot of time in a prior job trying to figure out why SQL Server Agent wouldn’t work on my instance of SQL Server due to configuration problems.
Well, now my scheduler was bundled into one solution.
Keep reading with a 7-day free trial
Subscribe to SeattleDataGuy’s Newsletter to keep reading this post and get 7 days of free access to the full post archives.