Modern Streaming Data Analytics And The US Navy's 100 Million Dollar Bet
And More Office Hours With A Data Science Consultant
The CTO’s Focus Of The Week
The focus for this week is streaming data solutions and analytics.
Streaming data is far from new. Kafka and other solutions have existed well over a decade. However, most of these were often very complex to implement and weren’t always approachable.
With that in mind, several companies have tried to make data streaming easier.
We will be talking about a few of these companies later on but we wanted to reference tools like Vectorized and Materialize as they both have recently secured new rounds of funding. Both of these streaming data solutions aim to help make streaming data accessible to a larger audience.
Often, I tell clients that they will likely not need streaming solutions and that it would likely be cheaper and easier to maintain batch ETLs. But the barrier to building streaming and real-time analytical tools is lowering every day.
As Narayan co-founder and CEO of Materialize put it.“Our goal is really to help any business to understand streaming data and build intelligent applications without using or needing any specialized skills. Fundamentally what that means is that you’re going to have to go to businesses using the technologies and tools that they understand, which is standard SQL,”. Now I would disagree with the “specialized skill” statement.
I find that most companies trying to make tooling to simplify development like drag and drop dashboards and low-code/no-code miss the point.
As a consultant, I have come in on many projects because a client purchased a tool that they had no idea how to use. Although the tools often make development easy, you still generally need a technical expert.
But that’s my perspective.
Now, if you need to see more real-life proof about organizations investing in streaming data. Then look no further than the US Air Force.
The US Air Force just awarded a five-year contract to Kinetica with a budget cap of 100 million dollars to create a streaming data warehouse. The goal of this project is to deliver said streaming data warehouse for the NORAD and USNORTHCOM Pathfinder program that will tie independent systems together across air, land, sea, space, and cyber domains, creating a fused operational picture.
It’s boasted that the Kinetica Data Warehouse can manage trillions of rows and run models to analyze risk in seconds or minutes.
Now, whether this project succeeds or not is yet to be seen.
Streaming is more than hype. Using streaming with IoT device data or just data that needs to be processed and analyzed as soon as it enters your team's applications can make a lot of sense. Just make sure you are using the right tool for the right job.
Ask A Data Consultant - Office Hours
Every newsletter I open up a day or two with a few slots for open office hours where my readers can sign up and you can ask me questions. I got to answer a lot of great questions so far and hopefully, they helped provide a lot of insights for those who signed up.
Sign Up Below:
Next Open Office Hours
I have also opened up the 24th to see if there are maybe some more general times people would like to talk.
Articles Worth Reading
There are tons of great articles on data science and engineering. The section below has a combination of articles we have read as well as written that cover some current topics in the data and tech space.
State of Machine Learning and Data Science 2020 - Kaggle’s Yearly Review
For the fourth year, Kaggle surveyed its community of data enthusiasts to share trends within a quickly growing field. Based on responses from 20,036 Kaggle members, we’ve created this report focused on the 13% (2,675 respondents) who are currently employed as data scientists.
If you’re curious what to expect and learn, here is just one of well over 2 dozen different charts and graphs about data scientist’s salaries, education, and so much more.
Below, you can see popular methods and algorithms used by a data scientist.
5 Data Analytics Challenges Companies Face in 2021
Integrating data into strategy is proving to be a differentiator for businesses of all sizes.
The cliche term “Data-Driven” is for more than just a billion-dollar tech company.
Even smaller companies are finding savings and new revenue opportunities left and right thanks to data.
However, this is all easier said than done.
Just pulling data from all your different data sources isn’t always sufficient. There are a lot of problems that can come up with developing your data strategy and products.
In this article, I will outline some of the problems you will run into using data including increasing data size, having consistent data and definitions, and reducing the time it takes to get data from third-party systems to data warehouses.
While at the same time providing some solutions.
Decrypted: A hacker attempted to poison Florida town’s water supply
We take much of our infrastructure for granted. Especially in the USA.
This week showed how easily that trust and reliance on even necessities like water can be threatened.
If you haven’t already heard, in the small town of Oldsmar in Florida a hacker broke into its drinking water supply and tried to poison it. The hacker gained access to a computer at the water facility used for running remote control software TeamViewer, according to Reuters, and jacked up the levels of sodium hydroxide, aka lye, which would have made the water highly toxic to drink.
This attack coming on the heels of the SolarWind attack is a somber reminder that we are fighting an ever more difficult battle when it comes to cybersecurity.
Companies We Are Watching
With our focus on streaming analytics. We wanted to look at Materialize and Vectorized and see what these tools are offering that might be interesting for our readers to learn about.
Materialize, is a SQL streaming database startup built on top of the open-source Timely Dataflow project.
It allows users to ask questions of living, streaming data, connecting directly to existing event streaming infrastructure, like Kafka, and to client applications.
Engineers can interact with Materialize using a standard PostgreSQL interface, enabling plug-and-play integration of existing tooling.
When the SQL queries are run they are recast as data flows. This can allow users to perform interactive data exploration and data warehouse-like analytics against live relational data, which is typically not possible.
Under the hood, Materialize uses Timely Dataflow (TDF) as the stream processing engine. This allows Materialize to take advantage of the distributed data-parallel compute engine. The great thing about using TDF is that it has been in open source development since 2014 and has since been battle-tested in production at large Fortune 1000-scale companies.
Vectorized is still on the newer side of streaming tools as it was just funded in January 2021 with 15.5 million dollars.
The startup’s entry into the crowded data management market is an open-source stream processing platform dubbed Redpanda. It aims to provide an alternative to the industry-standard Apache Kafka engine.
If you want to get a deeper explanation you can hear from the founder of Vectorized Alexander Gallego as he discusses it in the Data Engineering Podcast.
In this podcast, he will discuss how Redpanda was engineered as a drop-in replacement for Kafka. He also shares some of the areas of innovation that they have found to help foster the next wave of streaming applications while working within the constraints of the existing Kafka interfaces.
It’s a great listen if you want to hear about the driving factors for this technology.
The End Of Day Three
With that, we will wrap up our 3rd newsletter. We hope you learned something about what is going on in the data space. It can be hard to keep with all the new terms, words, start-ups and just general hype.
If you have any questions, please comment below or sign-up for our office hours!