Will Airflow Win The Orchestration Race?
Discussing Astronomer's Recent $213 Million raise and Spotify replacing Luigi
There is a lot happening in the world of orchestrators this week. From Luigi being officially replaced at Spotify to million dollar acquisitions.
Last week I was at the data council in Austin. Along with the rest of data twitter. It was great to see people that I had been talking to virtually for years in some cases in person.
Besides riveting conversations on how to set up data stacks most of my conversations eventually covered a few things. Astronomers recent purchase of Datakins and which company did best at corn hole at Lauren’s Modern Data Blowout.
But back to the orchestrators.
In case you missed it, Astronomer raised $213 million a few weeks back. This was led by Insight Ventures and was joined by Meritech Capital, Salesforce Ventures, J.P. Morgan, K5 Global, Sutter Hill Ventures, Venrock, and Sierra Ventures.
But billion-dollar valuations are nothing new in the data space at this point. What I found more interesting with this raise was the acquisition of Datakin.
What is Datakin you ask?
Datakin helps data engineers and data scientists better track the lineage of their data pipelines. Another way to put this is that Datakin helps data engineers trace relationships between datasets. Meaning that when a number looks off in a dashboard or there is a failure in a pipelines step, you can find out why fast.
This move makes a lot of sense as Astronomer looks to continue to grow its offerings to so much more than just managed Airflow.
As stated by Astronomer’s CEO Joe Otto, “For the last couple of years, we focused on Airflow and working with the people who created it,” he added. “Now we are working with them to take Airflow to the next level. We’ve learned how companies are using it, and we are getting ready to launch a product and start scaling field teams, so there is a big opportunity out there.”
There seem to be clear goals here to take Astronomer and build a lot more functionality on top of a managed Airflow service.
Perhaps re-bundling will be the way of the future for data infrastructure. Of course, this might be a little difficult with most of the top players being valued at billions of dollars. But, I can foresee the need for tool consolidation not only for the sanity of us data engineers but also to increase the distribution of tools that are more difficult to sell as a one-off solution.
Truthfully, I believe that the data space is going to get a lot more interesting in the next 2-3 years.
Let’s shift away from Airflow because, on the flip-side, Luigi, the orchestration tool managed by Spotify seems to have been fully replaced by Flyte according to a recent article on Spotify's engineering site.
Here were some of the reasons they listed on their site why they decided to go with Flyte:
Has a similar entity model and nomenclature as Luigi and Flo, making the user experience and migration easier.
Uses Task as a first-class citizen, making it easy for engineers to share and reuse tasks/workflows.
Has a thin client SDK by moving more to the backend: this makes the maintenance of the overall platform much simpler than our existing offering with two libraries (Python, Java) both holding the logic.
Is decoupled from the rest of our ecosystem, enabling mixing and matching in the service layer according to our strategy.
Etc.
I do love that engineering teams are so transparent about their technology these days as it’s great to see what is occurring under the hood and why.
Since Luigi has often been used as a comparison to Airflow, it does seem like Airflow might have won this battle.
Of course, Airflow is now facing at least two other competitors that are picking up steam.
Prefect and Dagster have been two other well-funded start-ups in the data space. For those who haven’t worked with or heard of these tools here is a quick overview.
Dagster
Dagster is developed by Elementl which was founded by Nick Schrock. The goal of Dagster is to take a different approach to data pipelines compared to Airflow and some other orchestrators. Instead of a more jobs or tasks based approach, Dagster is putting together a software-defined approach to data pipelines.
It does this by creating software defined assets. This concept starts by using code to define the data assets that you want to exist. These asset definitions are version-controlled through git and inspectable via tooling. In turn, they allow anyone in your organization to understand your canonical set of data assets and allow you to reproduce them at any time. Overall this approach offers a foundation for asset-based orchestration.
By changing the approach and focus of the orchestrator Dagster has provided data engineers and data scientists several benefits. This includes more classic testability for software, subset execution, the ability to have multiple schedules, and several other benefits. In the end, Dagster does offer a lot in terms of answering some of the problems developers face with Airflow and we will how it continues to grow over the next few years.
Prefect
Founded in 2018 by Jeremiah Lowin, Prefect is also an open-source modern workflow management tool designed to orchestrate data stacks by building, running, and monitoring data pipelines.
In comparison to Airflow, Prefect treats workflows as standalone objects that can be run any time, with any concurrency, for any reason. Another benefit is the fact that Prefect elevates dataflow to a first-class operation. Tasks can receive inputs and return outputs, and Prefect transparently manages this dependency. Also, to add to orchestration news Prefect launched Prefect 2.0 two weeks ago so you should also read more about that.
All that being said…
Airflow is still a heavy favorite. At least according to the poll I recently ran on my Youtube and Twitter that showed Airflow to be a preferred option.
Airflow does have a solid community so both Prefect and Dagster have their work cut out for them in terms of getting adoption by heads of data as they start their next data projects.
How Will You Orchestrate?
A lot has happened in the world of workflow orchestration in the last few weeks and we will likely continue to see these companies grow and challenge each other. Not just in terms of product but philosophy.
Unavoidably this will likely lead to multiple winners at the end of the day. Data engineers who prefer thinking about workflows as tasks might stick with Airflow and those that like the idea of software defined assets will likely push more into Dagster.
All in all, adoption and distribution is not the only concerns these start-ups will have this year. In 2022 we are going into a higher risk environment. Wars, inflation, increasing rates, and just general uncertainty are at an all-time high. Whether you are investing in tickers or private companies, the flight to quality will be occurring.
So, where do you lean?
Video Of The Week: Why Everyone Cares About Snowflake
Scrolling through Linkedin, I continue to find other data engineers who echo the same sentiment I have when it comes to cloud data warehouses.
That is to say that Snowflake and Bigquery are dominating the market. But here is the thing, According to Slintel, Amazon Redshift is still dominating the market with 26.36%( of course it doesn’t take much searching to find completely different numbers and ranking on different sites for the same few products).
So perhaps that point is moot.
What isn’t moot is the underlying current among data professionals who have had a chance up to use Snowflake.
Many seem to enjoy it?
The question is.
Why?
Join My Data Engineering And Data Science Discord
Recently my Youtube channel went from 1.8k to 24k and my email newsletter has grown from 2k to well over 6k.
Hopefully we can see even more growth this year. But, until then, I have finally put together a discord server. Currently, this is mostly a soft opening.
I want to see what people end up using this server for. Based on how it is used will in turn play a role in what channels, categories and support are created in the future.
No Sponsor this week but…I Am Doing A Talk
Why We Build Reliable Data Systems
Thursday, March 31 at 10 a.m. PST
Abstract
Building systems that are reliable are far from a new concept. Far before data driven was a buzzword, engineers were building bridges, nuclear reactors and chemical power plants. Some of which have ended up asterisks in the history books thanks to cost cutting around quality control, maintenance and being designed for reliability.
Data systems are no different. The failure of data systems may not have as devastating of an effect as failure in a chemical plant. But these failures can lead to bad decisions being made, incorrect billing and incorrect strategies being implemented. The fact of the matter is data observability shouldn’t be an afterthought because of time or budget constraints. Instead, they should be part of the process.
Data quality and reliability ensures that the data products companies build and decisions they make aren’t just reliable when the system is initially implemented, but also long after. In this discussion you will see how end-to-end monitoring, alerts and lineage can not only prevent errors from occurring, but accelerate root cause analysis and incident resolution when they do.
Articles Worth Reading
There are 20,000 new articles posted on Medium daily and that’s just Medium! I have spent a lot of time sifting through some of these articles as well as TechCrunch and companies tech blog and wanted to share some of my favorites!
Riding the AI Wave
It’s been almost a decade since Marc Andreesen hailed that software was eating the world and, in tune with that, many enterprises have now embraced agile software engineering and turned it into a core competency within their organization. Once ‘slow’ enterprises have managed to introduce agile development teams successfully, with those teams decoupling themselves from the complexity of operational data stores, legacy systems and third-party data products by interacting ‘as-a-service’ via APIs or event-based interfaces. These teams can instead focus on the delivery of solutions that support business requirements and outcomes seemingly having overcome their data challenges.
The Future of AI Infrastructure is Becoming Modular
When people think of AI and machine learning, self-driving cars, robots or supercomputers often spring to mind. But in reality, the AI use cases that are driving business results aren’t that sexy–at least not in the conventional sense.
For a while now, we at Sapphire have been very excited about what Databricks CEO, Ali Ghodsi, has dubbed “Boring AI”–using AI to drive tangible business value through reduced costs, increased revenue, improved human productivity and more.
Through the power of AI, marketers are gleaning greater customer insights and recommendations, manufacturers are predicting supply chain bottlenecks and insurance companies are more accurately assessing risks. It’s why the market for enterprise AI is growing at 35% annually and is expected to reach nearly $53B by 2026. It’s no wonder that positions for AI specialists and data scientists (a precursor to AI) are up 74% and 37%, respectively.
99% to 99.9% SLO: High Performance Kubernetes Control Plane at Pinterest
Over the past three years, the Cloud Runtime team’s journey has gone from “Why Kubernetes?” to “How to scale?”. There is no doubt that Kubernetes based compute platform has achieved huge success at Pinterest. We have been supporting big data processing, machine learning, distributed training, workflow engine, CI/CD, internal tools — backing up every engineer at Pinterest.
Why Control Plane Latency Matters
As more and more business-critical workloads onboard Kubernetes, it is increasingly important to have a high-performance control plane that efficiently orchestrates every workload. Critical workloads such as content model training and ads reporting pipelines will be delayed if it takes too long to translate user workloads into Kubernetes native pods.
End Of Day 38
Thanks for checking out our community. We put out 3-4 Newsletters a week discussing data, tech, and start-ups.
If you want to learn more, then sign up today. Feel free to sign up for no cost to keep getting these newsletters.