Data pipelines are a key component in any company’s data infrastructure.
One framework that many companies utilize to manage their data extracts and transforms is Airflow. Whether it’s 100% using Airflow and its various operators or using Airflow to orchestrate other components such as Airbyte and dbt.
For those unaware Airflow was developed back in 2014 at Airbnb as a method to help manage their ever-growing need for complex data pipelines and it rapidly gained popularity outside of Airbnb because it was open-source and met a lot of the needs data engineers had.
Now, nearly a decade later, many of us have started to see the cracks in its armor. We have seen its Airflow-isms( Sarah Krasnik).
In turn, this has led to many new frameworks in the Python data pipeline space such as Mage, Prefect and Dagster.
Many of us still rely on Airflow. But Airflow has its fair share of quirks and limitations. Many of which don’t become obvious until a team attempts to producitonize and manage Airflow in a far more demanding data culture.
So in this update, I wanted to talk about why data engineers love/hate airflow. Of course, I didn’t just want to state my opinion. Instead, I have interviewed several other data engineers to see what they like about Airflow and what they feel is lacking.
Where We Started
When I started interviewing other data engineers on which data pipeline solutions they started with I assumed there would be some level of consistency in their answers. Perhaps most people would start in SSIS or cron and bash scripts.
However, everyone started in different places. Some started with Airflow, others with SSIS, and others data integration systems that I had never heard of. But here are a few quotes:
Joseph Machado - Senior Data Engineer @ Linkedin - “I started running python scripts locally”
Sarah Krasnik - Founder of Versionable - “I’ve actually only ever used Airflow”
Matthew Weingarten - Senior Data Engineer @ Disney Streaming - “At my first company we used TIBCO as our enterprise data pipeline tool”
Mehdi (mehdio) Ouazza - @ mehdio DataTV ,- “Airflow was the first tool I used running an on-prem system”
It may feel like there are a lot of data pipeline solutions today. But there always has been 1000 different ways to transport data from point A to point B. Whether through custom-developed infrastructure or no-code drag and drop vendor sold options. Picking which solution was best was always a challenge.
Keep reading with a 7-day free trial
Subscribe to SeattleDataGuy’s Newsletter to keep reading this post and get 7 days of free access to the full post archives.