It can be easy to take certain tools for granted when you work for companies with mature infrastructure. One of my favorite tools at Facebook was iData.
iData was Facebook’s data discoverability tool. It provided a lot of functionality that I have started to miss. This included the baseline functions you would expect including the ability to find tables, trace lineage, and track down owners of said tables.
But there were also other beneficial features like cost tracking, data quality assessments, and table certification. All of these features made it easy for a new data engineer to quickly orient themselves as they started on new projects.
My Favorite iData Feature
My favorite features involved being able to see how other users were using the data on a query level. This provided a lot more context than just commented fields. ERDs and data lineage are all great. But seeing exactly how other users were using the data made it easy to understand(also they were great people to ping if you had questions).
It was so easy to quickly understand how the data was already being used. This provided several benefits including:
Reducing the duplication of work
Providing context on how data could join together(even across multiple data sources)
It would let you know who to ask questions about the data. Sure, the owner is one great place to start, but sometimes owners, over time, move away from datasets
Upon leaving the company formerly known as Facebook I felt like I kept stumbling on a new data catalog or discoverability tool every week. At this point, I am sure I have come across at least 3-5 dozen data discovery tools all of which add their own flair to helping teams manage their metadata.
With so many data discoverability tools out there, I wanted to take a moment and catalog all the data catalogs. Below you will find just some of the ever growing list of data discovery tools.
Catalog Of Data Catalogs
A
..Honestly probably a half dozen other “A” data catalogs.
B
C
D
E
G
I
Informatica Enterprise Data Catalog
L
M
O
P
S
R
T
Z
Do You Find Data Discovery Tools Helpful?
Data discovery tools can help reduce the amount of onboarding time for new hires as well as improve data workflows. But they are also often difficult to gain buy-in and just have so many options that many companies struggle to figure out which solution fits their needs best.
For me, having an easy-to-search tool at a company that probably had 30,000+ tables was a must. I did spend a lot of in the UI. Sometimes it was because I needed to update a few of the information so my table could be certified. Other times it's because I found a table that looked like it had data I wanted to pull in so I wanted to see if I could avoid building another pipeline.
Overall, as teams grow, the need for data discovery tools becomes a must. But I would love to hear your thoughts.
Do you find data discovery tools helpful?
Which ones do you enjoy?
Video Of The Week: Building Data Infrastructure With Neelesh Salian
Join My Data Engineering And Data Science Discord
Recently my Youtube channel went from 1.8k to 39k and my email newsletter has grown from 2k to well over 16k.
Hopefully we can see even more growth this year. But, until then, I have finally put together a discord server. Currently, this is mostly a soft opening.
I want to see what people end up using this server for. Based on how it is used will in turn play a role in what channels, categories and support are created in the future.
Articles Worth Reading
There are 20,000 new articles posted on Medium daily and that’s just Medium! I have spent a lot of time sifting through some of these articles as well as TechCrunch and companies tech blog and wanted to share some of my favorites!
Building a Real-Time User-Facing Dashboard with Apache Pinot and the Ecosystem
When we hear the term “decision-maker,” we often picture a “C-suite” or an “executive” looking at a computer screen full of numbers and charts, leveraging analytics to make data-driven business decisions. The question is should CEOs be the only ones who make decisions? Can it be a regular user like you and me?
With the abundance of technology and connectivity, many online platforms enable their users to get on-boarded, create digital content or provide services. These users need access to timely and actionable analytics to measure their content performance and tweak them to improve customer service.
For example,
Medium offers content statistics to their writers to tweak their content.
YouTube offers analytics to its video content creators, enabling them to tweak their content for maximum engagement.
UberEats Restaurant Manager provides real-time metrics to restaurant owners about their performance to improve menu items, promotions, etc.
Rethink Your Data Architecture With Data Mesh And Event Streams
According to a Gartner prediction, only 20% of data analytics projects will deliver business outcomes. Indeed, given that the current data architectures are not well equipped to handle data’s ubiquitous and increasingly complex interconnected nature, this is not surprising. So, in a bid to address this issue, the question on every company’s lips remains — how can we properly build our data architecture to maximize data efficiency for the growing complexity of data and its use cases?
First defined in 2018 by Zhamak Dehghani, Head of Emerging Technologies at Thoughtworks, the data mesh concept is a new approach to enterprise data architecture that aims to address the pitfalls of the traditional data platforms. Organizations seeking a data architecture to meet their ever-changing data use cases should consider the data mesh architecture to power their analytics and business workloads.
How To Start Your Next Data Engineering Project
Many programmers who are just starting out struggle with starting new data engineering projects. In our recent poll on YouTube, most viewers admitted that they have the most difficulty with even starting a data engineering project. The most common reasons noted in the poll were:
Finding the right data sets for my project
Which tools should you use?
What do I do with the data once I have it?
Let’s talk about each of these points, starting with the array of tools you have at your disposal.
End Of Day 50
Thanks for checking out our community. We put out 3-4 Newsletters a week discussing data, the modern data infrastructure, tech, and start-ups.
If you want to learn more, then sign up today. Feel free to sign up for no cost to keep getting these newsletters.
DataHub is quite great but I find dbt's approach to generating documentation out of your code to be the most elegant
What about https://www.tableau.com/products/add-ons/catalog? Salesforce seems to invest a lot into this plugin.