Modernizing Your Data Architecture With The "Modern Data Stack"

Community Update #14

Integrating data into your strategy is proving to be a differentiator for businesses of all sizes.

The cliche term “Data-Driven” is for more than just a billion-dollar tech company.

Companies like DiscoverOrg and MVF are using data to help drive their decisions and create better products.

Even smaller companies are finding savings and new revenue opportunities left and right thanks to data.

However, this is all easier said than done.

Just pulling data from all your different data sources isn’t always sufficient. There are a lot of problems that can come up with developing your data strategy and products.

In this article, I will outline some of the problems you will run into using data including increasing data size, having consistent data and definitions, and reducing the time it takes to get data from third-party systems to data warehouses.

While at the same time providing some solutions.

Too Much Data Leading To Poor Performance

Big Data.

It solves all your bad algorithm problems, right?

Sort of.

Big Data can cause a host of new problems.

In particular, performance.

Big Data, means if you don’t have systems that can handle the data, then you will quickly run into problems.

Even the largest corporations run into this. Data keeps growing and in turn slowing down dashboards, models, and reports. Waiting 2 minutes for a Tableau dashboard isn’t an option.

No executive, director, or manager wants to wait that long.

So although Big Data offers a lot of benefits as far as possible insights go.

Big Data also quickly becomes a burden(and I haven’t even referenced the problem with pruning unnecessary data).

So how do we improve the performance of big data systems?

Solution #1: Pay For More Compute

One way to solve the problem is just to spend more money. Buy more compute on your clusters or get bigger machines.

But there is always a limit.

I have worked with plenty of clients who quickly calculated that to improve their performance it would cost way too much if they went the route of more compute.

So this can be an easy step, but arguably not the best solution.

Sometimes, it’s less about compute and more about design as well as the underlying system.

Solution #2: Migrate To A Cloud Data Warehouse Designed For Speed

Through a combination of technical debt and just time, systems begin to slow.

There are several ways to improve without just buying bigger machines.

For example, if your team is using databases like Postgres or standard SQL Server for your data warehouse, then it might be time to migrate to a cloud data warehouse that was designed for analytical queries.

You could look into cloud data warehouses like Azure Synapse, Redshift, or BigQuery. These can help improve performance depending on how you design your data warehouse.

Solution #3: Improve Your Data Warehouse Design

If you’re already in a cloud data warehouse, then you can look into other possible solutions like a better design, aggregate tables, and indexes.

These are all solutions that would require looking at your design, looking at bottlenecks, and assessing the best solution.

Improving your overall design is a much hard topic to discuss. I would be happy to discuss this on a free consultation call or our open office hours that I hold every week.

You can also look at some of these articles on scaling as there are a lot of ways to approach the problem.

How To Improve Your Apps Performance Before It Slows Down

Improving Data Warehouse Performance


Managing Complex Business Decisions In Excel

Excel and spreadsheets continue to drive billion-dollar decisions in companies across the globe. This reliance on Excel has also led to millions and billion-dollar mistakes by even the smartest companies.

For example, in 2008 Barclays agreed to purchase Lehman Brothers, except spreadsheet errors led them to eat losses on contracts they did not intend to buy. The detailed spreadsheet of Lehman assets contained approximately 1,000 rows that needed to be converted into a PDF. Except, the original Excel version had hidden rows with 179 items that Barclays did not want. The notes that they did not want those contracts were not transferred to the PDF, but the hidden rows were. As a result, they had to buy 179 contracts that they did not want.

And in 2012 JP Morgan lost nearly 6 billion dollars due to what came down to mostly Excel errors.

Excel is a very versatile data tool that can help teams manage a lot of workflows. However, they are also very error-prone because of complex design, human error, and how Excel operates in general.

To avoid these errors your team has a few strategies you can put into place.

Solution #1 — Treat Excel Like Engineers Treat Code

If your Excel is being used to make large decisions, then you should treat it the same way engineers treat code.

This means there should be Excel reviews and test cases. It may seem like overkill, but Excel is very much like code. Some people even consider it a 4th generation coding language. That means it is prone to errors caused by logic, the process, and the inputs provided.

In turn, it should be treated as such.

Don’t just trust that an analyst, regardless of how smart they are will make a perfect Excel sheet. Brilliant programmers make bugs, and so do brilliant analysts.

Thus a logic review, test cases, and sanity checks should be put into place to reduce these kinds of errors. This all might seem unnecessary until your company loses massive amounts of money due to bad information.

Solution #2 -Automate Excel Using Python And SQL

Automate and develop clean data workflows for business processes that are well defined and can be converted into SQL and code.

A lot of reporting requires copy-pasting data from a report or database table. Instead of copy-pasting, there are methods your team can utilize to develop solutions that automatically provide the data outputs.

This can be done using a combination of SQL, code, and Spreadsheets with limited manipulation. Code is still error-prone. However, it can often be written in ways that limit the number of errors and inputs. This is not the case with Excel. Excel’s flexibility is what makes it error-prone.

Which solution is best depends on the complexity of the process, its repetitiveness, and how large of decisions are being made off of the data solution.


Getting Your Data Into Your Data Warehouses Fast

I have now talked to data scientists and analysts at companies across the US and they all have one major complaint.

They can’t get to their data fast enough.

This is often because there aren’t enough data engineers with free time to pull in every new data set.

Drastically slowing down the ability for data scientists and analysts to answer questions that could be saving the business thousands if not millions of dollars.

There is a constant balance between creating well-governed data systems and providing data quickly so business executives can make decisions.

But by the time the data is added, the a

Solution 1: Use An ELT Instead of an ETL For Ad-hoc Work

The truth is, at some point, with the speed modern companies need to move. Sometimes working with data that’s good enough can be better than working with perfect data.

Just to clarify. I believe that data that goes into your central data warehouse should be as accurate and dependable as possible.

However, thanks to alternative central data storage systems. Data scientists and analysts can start to get access to new data sets faster without having to wait for data to be 100% set up. With the caveat that the data could be funky.

This is where ELTs come in.

For those unfamiliar with ELTs and ETLs. These are methods of extracting, transforming, and loading data into data warehouses. Notice the first letter of that process makes up ETL.

I have gone over ETLs in the past.

The problem is, ETLs are slow to develop because the transform is often code heavy.

There are pros and cons to that, but I will avoid discussing that now.

ELTs move the transform over to the end of the process which allows data analysts and scientists to start to work with the data before its fully processed. This does pose some risk as far as data governance goes.

However, I believe ELTs play a role when it comes to ad-hoc analysis as well as attempting to figure out which data sets to model for your core data layer.

ELTs have lots of solutions and tools you can look at some of them below.


Having Consistent Data Values And Definitions

Not having consistent data definitions and values across teams can cause major issues.

In many ways, this problem somewhat conflicts with the solution I provided above. But from this perspective, I am purely focusing more on the core data models of a business.

So I believe this problem is slightly different.

For those who are unfamiliar with this problem.

Let me provide a simple example.

Let’s say you are creating a metric that is looking at multi-day events vs. single day events.

What defines a multi-day event.

Is it an event that lasts more than 24 hours or an event that spans more than 2 days or perhaps somewhere in between?

Pre-defining key concepts that your business uses daily is important because it is likely every business team will be using that term, KPI, and definition.

Meaning that if there isn’t a consistent definition across teams, you will likely see inconsistent reporting.

One team will say there were 100 multi-day trips while the other will say there were 90 multi-day trips.

I have seen this cause chaos in meetings. Suddenly, instead of focusing on the actual context and impact of some form of strategic change. Managers and directors are trying to discover the reason for the 10 trip difference.

Completely derailing any form of actual strategic purpose of the meeting and turning it into a waste of time as logic and people attempt to match numbers.

Similarly, another problem that occurs is with all the various integrating systems like Workday and Salesforce. Various fields can become unsynced.

For example, a company’s Salesforce instance might pull the job position once a week for employees from Workday. That great in terms of reducing manual intervention.

However, it poses a risk for reporting and having accurate information.

If a data analyst pulls from salesforce and assumes the data is always up to date. They could be wrong.

There are a lot of risks that occur by having a lack of clear definitions and clear sources of truth.

Solution #1 Implement a data governance strategy:

One great solution is setting up a data governance process.

Data governance is far from sexy.

It’s not data science or machine learning.

However, data governance is the process of managing the availability, usability, integrity, and security of the data in enterprise systems, based on internal data standards and policies that also control data usage. Effective data governance ensures that data is consistent and trustworthy and doesn’t get misused.

That last part in particular. Making sure the data is consistent and trustworthy.

Data governance, as its name, suggests, unavoidably adds more process and bureaucracy to moving data around.

Companies will set up data governance committees that work to ensure that the data that exists and is reported on is consistently the same.

Data in modern companies are often very integrated and automatically populates various third-party sources. I gave the example above of workday and Salesforce integrations.

Data governance helps define which fields should come from which sources so you don’t pull the same data from different sources causing a risk of data not matching.

Data governance is a human process.

Solution #2 Pick A Data Lineage Tool:

Other more automated processes can help your teams track data sources and how data goes from point a to point b.

For example, tools like TreeSchemaOctopai and Kylo automatically scrape your team’s various data sources to help better track metadata like who owns which table, what is the schema of your data and where did the data come from.

Not only do these tools help track where the data comes from, but it also helps track changes in the data and bad data practices.

For example, TreeSchema offers some features like alerting you when new data sources or fields are added. In a normal company, a new data object would be created but might not have any documentation for other users to know what that data is. If it does have documentation it might be in some shared folder or SharePoint site. Instead, with TreeSchema, you centralize your data documentation and get updates when there are holes in that documentation.

Having some method to track all your meta-data can help simplify and eliminate confusion about data as well as creating a more reliable core data layer.


Lack Of A Clear Data Strategy

One of the biggest challenges that most companies face is a lack of clear direction.

There is so much data to use, analyze, build data products off of and integrate that it makes it very difficult to know where to start.

Setting up a clear data strategy is generally step one.

What this means is looking at what your overall business goals are, and then seeing where you can align your data goals.

Not the other way around.

I have worked with clients whose goals were to integrate AI into their services. But they didn’t really think through the role AI would play in their overall business strategy.

Solution 1: Start With Your Business Goals First

Before you spend thousands if not hundreds of thousands of dollars on some new machine learning model or data warehouse. Do make sure that there are some benefits to your business. There needs to be alignment.

So first plan your business goals and then see what data you have that can help those business goals.

I wrote a post a while back to help you plan your data business strategy. Click here to read more about data analytics strategy.

Conclusion

Using data to make better decisions provides companies a competitive advantage. However, this depends on the quality of data and the robustness of data processes set up.

Simply creating dashboards, data warehouses, and machine learning models is not sufficient to make data-driven decisions. Teams need to consider their data life-cycles and the processes used to manage each step. This means creating test cases, clear goals and processes can help improve your team’s performance and strategy. No one wants to get bogged down with too many processes and bureaucracy but having no form of plan or strategy for your team’s data life-cycle will also fail.

Ask A Data Consultant - Office Hours

Every newsletter I open up a day or two with a few slots for open office hours where my readers can sign up and you can ask me questions. I got to answer a lot of great questions so far and hopefully, they helped provide a lot of insights for those who signed up.

Sign Up Below:

Next Open Office Hours

Sign Up For My Next Office Hours on August the 5th at 8 AM - 10 AM PT or between 5 PM - 7 PM PT


Thanks To The SDG Community

I started writing this weekly update more seriously about 6-7 weeks ago. Since then I have gained hundreds of new subscribers as well as 8 supporters!

And all I can say is, Thank You!

You guys are keeping me motivated.

Every read, comment, like and financial subscription is amazing and I really appreciate all you.

If you want to help support this community consider clicking the link below.

Support The Community Today


Video Of The Week: Software Engineer Vs. Data Engineer - Is There A Difference?

Software engineers vs data engineers What is the difference between a data engineer and a software engineer?

This answer can vary depending on which company and or team you are working on. In some cases data engineers are more like software engineers that focus on data. In other cases data engineers are more like BI developers.

This wide range makes it difficult for data engineers as well as companies to really articulate what they want and their skill sets. Hopefully this video helps discuss some of those differences.


Articles Worth Reading

There are 20,000 new articles posted on Medium daily and that’s just Medium! I have spent a lot of time sifting through some of these articles as well as TechCrunch and companies tech blog and wanted to share some of my favorites!

Start small when building a modern data stack

Building a modern data stack doesn’t require a lot of up-front effort. In fact, one of the biggest mistakes you can make when transitioning to a modern data stack is trying to do it all at once as a Goliath task.

There are a lot of our favorite folks in the data space that agree with this take (in fact, a bunch of them recently talked about building the modern data stack in a recent Mixpanel’s recent webinar).

By starting small, you lean into one of the modern data stack’s main strengths: its modularity. In the process, you’ll steadily construct a data infrastructure that’s efficient and customized to your business’s exact needs, all without stressful workflow disruptions and difficult buy-in conversations. So, start small. The journey to a modern data stack (and a leap into better data) starts with one small optimization.

Read More Here

Orchestrating ELT in Airflow: Announcing the Fivetran Airflow Provider

Astronomer helps organizations adopt Apache Airflow. As part of these efforts, they have just released the Astronomer Registry, a hub for Airflow providers designed to bridge the gap between the Airflow community and the broader data ecosystem. If you are familiar with dbt’s hub for community contributions and Fivetran's dbt packages, this registry may look similar.

Fivetran's developer relations team worked with Astronomer as partners on the launch, developing a provider that allows users to start and monitor Fivetran data syncs from within Airflow. The provider only needs an API key and API secret and the connector IDs configured to combine the benefits of automated data ingestion from Fivetran and data pipeline orchestration from Airflow. The three main benefits our users are experiencing for this provider are the following:

Read More Here

The Future of the Modern Data Stack

The Modern Data Stack commonly refers to a collection of technologies that comprise a cloud-native data platform, generally leveraged to reduce the complexity in running a traditional data platform. The individual components are not fixed, but they typically include: 

Read More Here

Types of Product Usage Segmentation

How, and why, do users engage with your product? What types of users engage with your product? Do these different “types” have different needs or expectations?

Users will teach us more about our own product than any amount of self-study possibly could. Only users, through their interactions with the product, can bring it to life.

Product usage segmentation is about prizing these interactions and using data to get as close to them as is possible, so that we might learn about our product, its users, and the bi-directional relationship between the two. The more granular the product analytics a team can gather, the better equipped they’ll be to make decisions about their product roadmap.

This article will explore “the what, the why, and the how” of employing product usage segmentation to understand your users and make data-informed product decisions.

Read More Here


End Of Day 14

Modernizing your data architecture can be a game changer. You can reduce the number of engineers managing your data systems and improve its performance.

But at the same time, if your companies not facing these challenges, then don’t fix what isn’t broken.

Thanks for reading!

Loading more posts…