Should You Build a Custom Data Orchestration System? Here’s What to Consider

Oct 23, 2024

As you start building your first data pipelines, you’ll slowly realize you need to address a growing number of recurring issues. Maybe you implement a component or process that tracks what jobs are running, a scheduler, a set of generic scripts to run transforms and data ingestion, or even some form of UI.

Before you know it, you’ve pieced together something that looks like Airflow. Something that goes beyond just being a set of data pipelines but starts looking like an orchestrator.

Surprisingly (or maybe not), I’ve seen countless homegrown orchestration/data pipeline systems. Often, it feels like, given enough time, the team might build its own Airflow-esque solution.

In fact, I’ve come across plenty of posts where people ask if they should just build an orchestrator in-house.

That got me thinking: What does it really take to build a custom orchestrator and data workflow system? And should you even bother?

Here’s my take on what it involves, and a few examples from my past experiences.

Examples of Data Workflow Systems I’ve Encountered

I’ve been fortunate to see a wide range of approaches to data pipelines and orchestration systems. Many teams don’t even have what you’d call a “workflow orchestrator.” Instead, it’s usually more of a script that calls a bunch of other scripts in a specific order.

So this would likely fit under the umbrella of a data pipeline or ETL/ELT system.

But let’s look at some real-world systems.

One common example of data pipelines is the standard script called by a scheduler. On Linux, it’s Cron; on Microsoft, it’s Task Scheduler. These systems often have a central script that drives the rest of the system. One particular instance I encountered relied on PowerShell instead of Python for the control system. But they all have similar components.

Here’s a breakdown of its key components:

Transform Running Module
Logging
Data Quality Module
Metadata Tracking

One example of a system I have seen is below.

This kind of setup is rigid, doesn’t necessarily track dependencies, and makes future changes less dynamic. Although it does allow you to at least just add new files to folders which is going in the right direction. Also, truth be told, I’ve seen even more rigid versions where every data connector is written in its own method and everything is very tightly written. It’s not exactly dynamic.

On the other end of the spectrum, companies like Airbnb, Facebook, and Spotify built their own orchestrators. Now there is a delineation between orchestrators and data pipeline systems. But functionally many companies will use these orchestrators to manage their data workflows.

If you haven’t seen Data Swarm, it’s very similar to Airflow; you can look at code examples below.

Dataswarm code example from Mike Starr’s presentation. I will say this is rather old in terms of what dataswarm looks like. It’d be great to see if Dataswarm has changed since I last worked at FB.

These are built to be generalized so that any company can use them, covering key orchestrator facets like scheduling, dependency management, and UI. We’ll dive into these functionalities in more detail later.

Not to mention there is a broad set of alternatives to Airflow these days. They range from similar frameworks to low-code/no code options. The challenge comes when you realize that some of these alternatives are actually work flow orchestration solution and others are purely data pipeline solutions. This becomes apparent when you start to try to do work outside of the data warehouse/data lake.

For example you might want to push a file somewhere, manage external data connectors/transform solutions or perhaps refresh a data source for Tableau. Some of the tools on the spectrum will provide it, others won’t. Although, I am wondering if the below image to be more of a quadrant vs a spectrum.

After all, there are such a broad range of options. You might want to use a tool that is low-code/no-code or one that is just a Python library.

When Should You Build Your Own Orchestrator?

Let’s get to the part where I tell you if you should build your own orchestrator system. Obviously, there is a lot of nuance there. On the basic end, you might have a few Python scripts that handle extraction, transformation, and loading (ETL), all wrapped in a main script.

Since you’re only in need of a simple solution you might be able to get away with creating a basic data pipeline system. That isn’t an orchestrator but just a system that automates your basic data tasks.

On the flip side, you might be aiming to build a highly generic system that functions as the orchestrator itself. This is where you need to pause and ask yourself: Do you really want to reinvent Airflow? If you decide to pursue a generic solution, there are a few key considerations:

Vision: If you plan to build a new orchestrator, I believe you need to pause and ask yourself a few questions. What will set your orchestrator apart from the many others out there? How will your solution be better? How are you reproaching the problem of managing automating tasks, specifically data?
Design: You’ll face numerous design decisions from the outset. How will you handle dependencies? What’s your approach for loading new pipelines? If you start building your orchestrator without addressing these fundamental questions, you risk ending up with a solution that’s less effective than any of the existing data workflow orchestrators. Take the time to study past solutions—identify what worked, what didn’t, and why. Understand the core challenges of orchestration, then think critically about how you can design something better.
Overhead: Building a successful orchestration solution is rarely a one-person job. While it might seem straightforward to replace all the functionality, even established orchestrators with large teams struggle to innovate and improve. Unless you’ve discovered a truly unique approach, it’s unlikely that a single engineer will be able to build a better solution than what’s already available.
Adoption: Being a good engineer won’t be enough—you’ll also need to be a good marketer. Convincing engineers or other professionals to switch from familiar tools is like pulling teeth (just think of how entrenched Excel is). Without a solid go-to-market (GTM) plan, even the most well-designed generic orchestrator will struggle to gain traction and may go largely unnoticed. Even at your own company.

It’s important to note that I don’t mean to discourage you. However, if you plan on building your own orchestration layer, I hope you’ll take the points above into serious consideration.

Functionality Required For An Orchestrator

If you’re still hell-bent on building your own orchestrator and want it to serve as a general-purpose solution, here are some essential features to consider:

Backfill: The word “backfill” can send shivers down the spine of most data engineers. Much of this depends on how your orchestrator handles this task. You may need to re-run thousands of jobs, which can be time-consuming and error-prone. During my time at Facebook, I had to do this many times, and things didn’t always go smoothly. Streamlining this process is crucial for a great orchestrator.

Dependency Management: Even modern orchestrators, especially piecemeal solutions cobbled together from various tools, struggle with tracking which tasks have run and what should come next. Whether you’re dealing with real-time events or batch processes, your system needs to manage dependencies smoothly. Otherwise, you may end up with a tangled web of scripts that are difficult to maintain.

Scheduling: In my three years at Facebook, daylight savings often cause scheduling headaches. Many jobs would end up running twice for the same hour, which in some cases might have caused issues. While our team took measures to avoid issues, other teams had to make adjustments to their pipelines. The reality is that something has to handle your task scheduling—will it be Cron or a more sophisticated solution? Whatever you choose, make sure it can handle these kinds of quirks without disrupting your workflows.

Metadata Management: There will be a lot of metadata that needs to be tracked—not just job dependencies, but also which jobs have been completed, their execution times, any errors they encounter, the types of errors, and job ownership. And that’s just the start. All this information needs to be organized and easily accessible.

Integration: There are plenty of great tools for handling transformations, extraction, data quality, and more. Your orchestrator needs to integrate well with these tools. Even Airflow has challenges when running dbt (and, of course, now dbt aims to be an orchestrator itself), so make integration a priority.

Alerting and Logging: Alerting systems often swing between being too sensitive or not sensitive enough—sometimes both at once. AI could potentially help address the problem of alert fatigue, where too many alerts cause important ones to be ignored. On the other hand, logging is generally more straightforward. As long as your system allows you to review past events and customize log messages, you should have what you need.

A Few Other Considerations

The above, in my world, would be considered the baseline. You should also consider how you’d handle the following:

Data Governance
Data Lineage
Ease of managing different environments(Dev, Test, Production)
Testing
CI/CD

When you first start building a data workflow orchestration system, it might seem straightforward. However, as you add more functionality and scale, you’ll quickly realize that it’s much more complex than simply organizing a few scripts to run at specific times.

Note: Do let me know if you feel there are other must haves for a workflow orchestrator.

So Should You Build Your Own?

Whenever I see someone start building their own data pipeline system, I can’t help but feel a bit concerned. Don’t get me wrong, if you keep it simple, things will run smoothly for a while. But over time, as you build more and more pipelines, you’ll realize that a collection of scripts calling scripts quickly falls short. You’ll end up needing a generic logging system, a database connector, an API connector, and more.

And don’t forget the inevitable schema changes and a whole host of other challenges that come with scaling.

I’m not here to completely discourage anyone from building their own solution—just make sure you’ve taken a hard look at all the available options and weighed the costs and benefits carefully.

I believe this comment below sums up my thoughts pretty well. If you do plan to go down this route. Don’t just think about the surface level issues that you’re facing with the current orchestrator. There is likely a reason why these solutions made certain design decisions.

So if you go a few steps deeper and real analyze the problem space, and see something that is worth reconsidering, and you want to commit the next decade building out your solution, then I’d say go for it!

With that, thanks for reading!

Join My Data Engineering And Data Science Discord

If you’re looking to talk more about data engineering, data science, breaking into your first job, and finding other like minded data specialists. Then you should join the Seattle Data Guy discord!

We are now well over 7000 members!

Join Now

Join My Technical Consultants Community

If you’re a data consultant or considering becoming one then you should join the Technical Freelancer Community! We have over 700 members!

You’ll find plenty of free resources you can access to expedite your journey as a technical consultant as well as be able to talk to other consultants about questions you may have!

Join The TFA Community Today

Articles Worth Reading

There are 20,000 new articles posted on Medium daily and that’s just Medium! I have spent a lot of time sifting through some of these articles as well as TechCrunch and companies tech blog and wanted to share some of my favorites!

Genie: Uber’s Gen AI On-Call Copilot

In today’s fast-paced tech environment(Tell me you used ChatGPT to help write this article without telling me you used ChatGPT to help write this article), maintaining robust on-call operations is crucial for ensuring seamless service functioning. Modern platform engineering teams face the challenge of efficiently managing on-call schedules, incident response, communication during critical moments, and strong customer support on Slack^® channels.

At Uber, different teams like the Michelangelo team have Slack support channels where their internal users can ask for help. People ask around 45,000 questions on these channels each month, as shown in Figure 1. High question volumes and long response wait times reduce productivity for users and on-call engineers.

Why Your Data Stack Won't Last - And How To Build Data Infrastructure That Will

SeattleDataGuy

August 11, 2024

Why Your Data Stack Won't Last - And How To Build Data Infrastructure That Will

As a consultant, I have been called in to review and, in many cases, replace dozens of half-finished, abandoned, and sometimes forgotten data infrastructure projects.

Read full story

End Of Day 148

Thanks for checking out our community. We put out 3-4 Newsletters a week discussing data, tech, and start-ups.

Isai Morfin

Oct 23

Great article! I remember watching one of your older videos that focused on orchestration tools and it went over Luigi and Airflow. I dont remember seeing Luigi again in any other videos or posts, has Airflow and other competitors eclipsed it? Or is it like everything where its situational?

Expand full comment

1 reply by SeattleDataGuy

Hugo Lu

Definitely consider Orchestra for more or less all the functionality you mentioned above, code-first but ready out the box xo

2 replies by SeattleDataGuy and others

3 more comments...

SeattleDataGuy’s Newsletter

Why Your Data Stack Won't Last - And How To Build Data Infrastructure That Will

Discussion about this post