Hello readers! Before diving into this weeks newsletter, I did want to provide a heads up for the next newsletter that will be out next week!
I have been working with a Machine Learning(ML) Engineer to understand how League Of Legends deploy’s their ML models. If you enjoy LOL or ML, you’re not going to want to miss out. Now back to our regularly scheduled programming.
Good Design Is Easier to Change Than Bad Design - The Pragmatic Programmer
Programming is just one aspect of the difficulties of tech work for data engineers. Creating simple yet robust systems that help manage your data infrastructure is equally important.
This challenge of building a simple yet robust data infrastructure remains even with no-code/low-code solutions. Just because you aren’t having to write as much code or use a CLI doesn’t mean you removed complexity.
In fact, these solutions may actually accelerate the creation of convoluted data pipelines and transforms. Designing simple and robust systems requires time, experience, and weighing trade-offs. Every design decision, whether a tool or standard, affects future data engineers and analysts. We are trying to take complex problems and make them easier to deal with.
Good design helps remove that complexity both now and in the future. But good design rarely happens immediately. Requirements change and team members leave. So how do you build systems that reflect good design?
In this article I wanted to discuss what can often cause complexity, what risks it poses and how we can improve the data infrastructure that we build.
How Do We Create Complexity In Our Data Infrastructure
There are many ways complexity can be introduced into your data infrastructure.
But I wanted to focus on an example I ran into a few times at Facebook.
Most categorical data at Facebook wasn’t stored in a some form of local data structure (e.g. a dictionary or hash map) such that only the code that interacted with it would know what category was attached to what ID
But there were a few instances where the engineer decided it was simpler to store the actual name of the category data in a local config and only pass the ID to the database.
Yes, this is “simpler to implement”, but on a system level this design decision has pushed complexity downstream.
In this example I, as the data engineer now had to both create a second place where the logic existed to decide which ID went to what name as well as create a data quality check to ensure that every ID has a name in case they had new categories (yes I talked with them and their team multiple times to change the situation).
This is bad system design because:
❌ We have made change more difficult
❌ More individuals now need to be aware of said logic
❌ The data engineer might choose not to set-up QA which could then pose future issues
❌ Etc…
Don’t even get me started on if this isn’t fixed at the table level and enters Excel VLOOKUP territory.
This is just one of many ways unnecessary complexity gets produced by making a bad design call.
What makes this worse is this often compounds. Each bad design decisions makes more optimal design decisions as future engineers have to work around the limitations of the prior systems.
So how can we improve our future data infrastructure?
Good Systems Reduce Complexity
Well designed data infrastructure reduces complexity by taking time upfront to deal with complexity in your brain while you think about good design rather than implementing into your future infrastructure.
Here are a few tips on how you can reduce complexity in your data infrastructure.
Patterns And Standards
“Why is decoupling good? Because by isolating concerns we make each easier to change.
Why is the single responsibility principle useful? Because a change in requirements is mirrored by a change in just one module.
Why is naming important? Because good names make code easier to read, and you have to read it to change it.”
Every design choice and pattern has trade-offs. Even decoupling components has its trade-offs. But picking a set of design principles and patterns that will drive your data teams’ approach is a must. Regardless of if you are using custom code or low code solutions.
Design patterns and standards help make code easy to work with and easy to change.
For example at Facebook, a step our team took to make it easy for anyone to understand our data pipeline design was to always put a task into all of our pipelines at the very end. That task was the one below.
Why?
Well now if you want to import the last task from this current data pipeline so your new data pipeline can depend on it, you don’t have to go into the file and figure out what it was.
The same design pattern makes it so you don’t need to be aware of anything that is going on in the actual pipeline. You just need to be aware of when it is ready().
Similarly, most tasks followed a naming convention where it was “insert_raw_<table_name>” or “insert_prod_<table_name>”.
On several occasions this naming convention allowed our team to change thousands of lines of code all at once via codemod because we could automatically run a change only on specific patterns.
Overall, what may feel like small changes are conscious decisions to make your data infrastructure easier to deal with.
Avoid Tunnel Vision
When given a task it can be easy to become singular minded and only focus on the goal of said task. A great example is changing a column data type. A very simple task that at some companies can have drastic outcomes. That’s why it can take a little longer than just running an ALTER statement on your table when you are asked to change a data type. It could possible break everything from dashboards to machine learning models depending on how tightly connected they are to data types.
Yet it’s just too easy to get tunnel vision (especially early on in your career as a data engineer) and not think about the larger impact on a system your data change can have.
As referenced earlier, it is simpler for a SWE to insert a local enumerator vs build out a table or configuration that can then be accessed by others. So for their specific task, they picked a simple solution. But it was only simple from their perspective. Once you go out into the rest of the systems design, it’s now hidden tech debt.
It impacts both future engineers who need to maintain the code as well as downstream users of the data.
Iteration
Invent And Simplify - Amazon
Creating simplified processes and systems doesn’t happen overnight. On some occasions it happens over several iterations of developing a system. In fact, the joke at Facebook was that in order to get promoted you would first build a system that solved a complex problem with a complex system, get a promotion and then spend the next half simplifying it to get another promotion. Joke or not, this does align with Amazon’ invent and simplify notion. The first pass of a design or implementation may be good but with time developers will realize they could simplify the process.
This is how our team slowly went from over 150 lines per pipeline developed to around 40-50. Every 3-6 months someone else came in and added in a new simplification, interface or template that helped streamline the process.
In the end, not all good system design will occur immediately but as your data infrastructure is forced to change your team can continually look for opportunities to improve it.
Build For Change
One of the truths you get accustomed to working in enterprises, big tech, and as a consultant is that change is unavoidable. Whether the change occurs now because of management deciding they want new features or in the future when a dependency is sunset.
Change will happen.
An example of this is that recently I have been working on a project where the client was forced to migrate multiple components including their CRM, data warehouse, and data pipeline all in one go. This was because everything was tightly coupled together so when one piece was being sunset from a vendor all of them were impacted. In turn, this led to a several million dollar migration project that had huge risks and implications upon failure. If your data infrastructure is not built for change, this is unavoidable.
Products and solutions sunset. Old versions of programming languages eventually stop being supported.
We can pretend that we will build systems that will never change, but they will.
The Only Constant Is Change
Good design is hard because it can feel easier to put off even basic design decisions until later. It takes time, discipline and experience to know how to design solutions that can change when they need to. To add to this problem as data engineers we tend to be that bottleneck that has to deal with both software and end-user needs. Thus our data infrastructure constantly needs to adapt and change and thus must be built to do so.
Building any system is hard, but it’s far from impossible. Good practices, good design and collaboration with other teams can help make it easier.
What are your thoughts on well designed data infrastructure?
Seattle Data Guy Events!
Sign Up For Data Engineering Things And SDG Data Conference - Late October - Virtual
Data Vault User Group: Business Modeling to Data Vault Modeling - Virtual
Select Star gives you an automated data catalog, lineage, and usage analysis across thousands of datasets, so you & your team can find and understand data easily.
Jobs
Data & Analytics Engineer - Gorgias
Sr. Data Engineer - Anaconda
Data Engineer - Local Logic
Join My Data Engineering And Data Science Discord
Recently my Youtube channel went from 1.8k to 51k and my email newsletter has grown from 2k to well over 31k.
Hopefully we can see even more growth this year. But, until then, I have finally put together a discord server. Currently, this is mostly a soft opening.
I want to see what people end up using this server for. Based on how it is used will in turn play a role in what channels, categories and support are created in the future.
Articles Worth Reading
There are 20,000 new articles posted on Medium daily and that’s just Medium! I have spent a lot of time sifting through some of these articles as well as TechCrunch and companies tech blog and wanted to share some of my favorites!
TBM 53/52: "Are All Companies Messed Up?"
“Are all companies messed up?”
Let’s explore that question.
I have found the words "chronic" and "acute" to be helpful when thinking about the health of a company, and the relative messed-up-edness of companies. Like any words used to describe organisms or ecosystems, there are drawbacks (most notably taking the comparisons too far), but there is value here.
End Of Day 66
Thanks for checking out our community. We put out 3-4 Newsletters a week discussing data, tech, and start-ups.
Great stuff! Thank you