The Challenge Of Working In Data
“Why do we still struggle to answer basic questions like how many customers are still active or what exactly is the company’s churn?”
This was a question posed to a panel I was sitting on at the Snowflake Summit. Several panelists provided different answers and all come from very different perspectives. Truth be told I have mulled over this question for the last few years as I worked on projects that involved reconciling basic numbers like several active customers or total sales for a company.
Metrics most would assume would be easy to calculate but continue to plague companies even after they spent large amounts of their budget trying to build robust data infrastructure.
Rarely is this due to one cause. Over the years I have seen a multitude of reasons why companies struggle to answer basic questions. These reasons range from constant turnover of developers, ERP and CRM migrations, producers of data constantly changing what data they provide, and mergers and acquisitions(of course there are still more reasons). All in all, trying to continually report even basic numbers can prove to be very challenging when the underlying components are constantly shifting, but let’s discuss some of these issues and how we can try to mitigate them.
Developers Want To Develop
When you hire a developer, they generally want to develop. Meaning if you provide them a role where they are stuck maintaining old code or using a pre-built solution they will become frustrated.
I know, I have been there. Developers want to build new things and every problem is an opportunity to do so.
Taking away the opportunity for developers to build is taking away what they find joy in doing. This is great when there are complex problems to solve. Your developers can build a solution that helps manage said complex workflow.
However, then the team or individual leaves and no one knows or wants to maintain the system they build. So the cycle starts again. A developer joins the company, decides to build new infrastructure, perhaps they get a few metrics down, and then leaves 18-24 months later.
Leading to a repetitive cycle of constantly developing and redeveloping the same infrastructure over and over again. This pattern generally occurs in smaller and medium-sized companies where data teams are small leading to a lack of process ever fully being developed.
One solution is to provide a clear framework that future engineers need to adhere to. Although every engineer has their preference of tools and best practices, it would be a good idea for a company to set standards and provide a framework that future engineers, when hired, can fit into.
Otherwise the cycle will continue.
Software Engineers Don’t Like To Be Slowed Down
If you’ve worked as a data engineer in a product heavy company you have probably had to constantly keep track of your application source tables very closely because software engineers move fast and care about functionality first and data second..or maybe third.
This often leads to a “throwing over the fence”(Chad Sanderson) mentality from the producers.
In turn this causes several problems.
Data engineers have to produce logic that is very brittle into their pipelines in order to manage any issues or limitations the data has.
Data engineers need to create systems or manually track tables for any changes that will break their pipelines daily.
One solution for this problem has been supported by several data experts is the concept of a data contract.
A data contract is a written agreement between the owner of a source system and the team ingesting data from that system for use in a data pipeline. The contract should state what data is being extracted, via what method (full, incremental), how often, as well as who (person, team) are the contacts for both the source system and the ingestion.“
— Data Pipelines Pocket Reference, By James Densmore (O’REILLY 2021)
On one side this is great, it provides better collaboration and opportunities for transparency.
I can understand why both the software and data engineering teams will feel like this slows down their work.
However, I believe this is a happy middle-ground. Where instead of involving an extra person to try to manage all of the changes, data engineers and software engineers can manage the changes themselves through a clear process.
Migrating ERPs And CRMs
Companies are constantly switching out their underlying CRMs, ERPs, and their data infrastructure(such as the data pipelines and data warehouses). This is driven by the fact that solutions get sunset and companies are trying to improve performance or reduce costs.
All of which have the same result. The need to migrate data pipelines to a new source. Of course, it’s not that simple. The new source will also have a different underlying data model and will likely require new business logic to translate it.
Constantly switching out the underlying infrastructure doesn’t just slow down current initiatives for a data team, it generally takes them back a few steps.
There are other steps companies can take to reduce the impact of migrating data sources such as including the data team in conversations for the migration project to ensure that the new ERP or CRM still provides the same level of data coverage.
In the end, migrating data sources can be challenging mitigate(even if you're data model is well designed). Guaranteeing that the new data source has all the right fields and provides the ability to create all the same business logic is never going to be 100%.
Mergers And Acquisitions
One possible project I was going to take on involved a company that had 6-7 other companies recently merged into one. All running different ERP and finance solutions. Now all of the current ETL, data pipelines, and data warehouses that these various companies have developed need to be centralized.
In the long term, the company will centralize all of its business processes into a single set of solutions. However, until then, the company still needs to report on all that is occurring in its multiple lines of business. Thus, there is a moment of chaos. A moment when a data team will have to force all of the various sources into some congruent reporting layer.
Overall, when companies merge it is best to make sure there is a lot of communication between all the various data teams. The sooner the better. Discussions on what to do with all the various data sources and how to combine their need to happen quickly to reduce friction and distrust.
Lack Of Metrics Definition Process
The problem with defining metrics like churn is it can mean something different depending on the business you're in or even how the question itself is formed. Perhaps a customer makes a large purchase once every 6 months where another customer makes a purchase every month. How will you define churn?
If you leave the process of metrics defining ambiguous, you will receive ambiguous results.
When defining metrics it is important to make sure you define the goal of the metric, have clear buy-in from stakeholder who will care about said metric and document in english what the logic is and what it represents.
All of this can seem like unnecessary steps, but having even a baseline process ensures that how metrics and created and managed is standardized.
Will We Ever Be To Answer Basic Questions?
Our struggle to answer basic questions in business is driven by many different factors. Many of which have less to do with the size of data and more to do with all the components that change over time. Whether that be employees quitting, ERPs being swapped out or another migration project to a new data warehouse solution.
All of these constantly moving pieces force most data teams in a 2 steps forward 1.75 steps back approach to their work. There might be progress, but its very slow and can easily reversed to a state where even answering basic questions becomes challenging again.
It might seem like this post was filled with lamentations. But honestly, I have found in the last few years teams are capable of getting back to a state where they can answer basic questions faster and faster as tooling and collaboration improve.
How have you ensured that your data team keeps moving forward?
Video Of The Week: Write Better SQL
SQL is one of the most popular tools when it comes to working with data.
Whether you're an analyst, a data engineer or data scientist.
We all use it heavily to work with data.
Join My Data Engineering And Data Science Discord
Recently my Youtube channel went from 1.8k to 32k and my email newsletter has grown from 2k to well over 9k.
Hopefully we can see even more growth this year. But, until then, I have finally put together a discord server. Currently, this is mostly a soft opening.
I want to see what people end up using this server for. Based on how it is used will in turn play a role in what channels, categories and support are created in the future.
Articles Worth Reading
There are 20,000 new articles posted on Medium daily and that’s just Medium! I have spent a lot of time sifting through some of these articles as well as TechCrunch and companies tech blog and wanted to share some of my favorites!
Dynamic Kubernetes Cluster Scaling at Airbnb
An important part of running Airbnb’s infrastructure is ensuring our cloud spending automatically scales with demand, both up and down. Our traffic fluctuates heavily every day, and our cloud footprint should scale dynamically to support this.
To support this scaling, Airbnb utilizes Kubernetes, an open source container orchestration system. We also utilize OneTouch, a service configuration interface built on top of Kubernetes, and is described in more detail in a previous post.
In this post, we’ll talk about how we dynamically size our clusters using the Kubernetes Cluster Autoscaler, and highlight functionality we’ve contributed to the sig-autoscaling community. These improvements add customizability and flexibility to meet Airbnb’s unique business requirements.
Introducing Unistore, Snowflake’s New Workload for Transactional and Analytical Data
Snowflake has once again transformed data management and data analytics with our newest workload—Unistore. For decades, transactional and analytical data have remained separate, significantly limiting how fast organizations could evolve their businesses. With Unistore, organizations can use a single, unified data set to develop and deploy applications, and analyze both transactional and analytical data together in near-real time.
We’ve seen the impact that removing data silos can have—be it bringing faster analytics to large-scale data, or changing the world of data collaboration. The new use cases Unistore creates will define what it means to be data-driven, now and in the future, whether it be streamlining your business, knowing and serving your customers, or revealing previously unforeseen market opportunities.
Full-Spectrum ML Model Monitoring at Lyft
Machine Learning models at Lyft make mils of high stakes decisions per day from physical safety classification to fraud detection to real-time price optimization. Since these ML model based actions impact the real world experiences of our riders and drivers as well as Lyft’s top and bottom line, it is critical to prevent models from degrading in performance and alert on malfunctions.
However, identifying and preventing model problems is hard. Unlike problems in deterministic systems whose errors are easier to spot, models’ performance tends to gradually decrease, which is more difficult to detect. Model problems stem from diverse root-causes including…
End Of Day 46
Thanks for checking out our community. We put out 3-4 Newsletters a week discussing data, the modern data stack, tech, and start-ups.
If you want to learn more, then sign up today. Feel free to sign up for no cost to keep getting these newsletters.