From Data Modeling to MLOps For LLMS: Lessons from the DEML Summit
With Links To All The Talks!
A few weeks back I co-hosted the very first Data Engineering And Machine Learning Summit.
We had well over 3000 people sign up and had over a dozen speakers share their experiences and thoughts in the data industry. Some talked about data modeling others data science and oil drilling and still others about LLMs(of course!).
If you happened to miss it, then I wanted to share a few of the takeaways as well as share links to the talks so you could revisit them(all the links below)!
With that here are some of the key takeaways from 5 of the DEML talks.
1. Normalization Can’t Happen Without Denormalization
‘s talk revolved around taking a step back and discussing data modeling, WTF is it. In this talk he discussed the concepts of normalization and denormalization. One of the lines he said was in order to denormalize, you need to first normalize your data. Even if you don’t plan to use or have a normalized data model for your operations database. Normalizing your data gives you the ability to understand what it actually represents. The relationships, business interactions, transactions, events, all of it.
In 3rd normal form, this can provide some mental clarity in terms of what is actually going on. But as he brought up, it's very common for us to just dive in and create tables as we need them. Leading to JBOT or “Just a Bunch of Tables”. It was a great refresher on the basics of data modeling.
2. Tangible Data Science Projects
There are a lot of great examples of data science and analytics projects out there. But there is something different when you hear someone's project that is tangible. It’s not focused on product analysis or some form of digital marketing.
Instead, it’s focused on real things. That’s what I really enjoyed about Jessica Iriarte’s talk. She was talking about the process of developing models and automated systems to improve drilling for oil. Which is a highly complex practice. So it was great to hear about something that is so tangible. It was so real and it was a fascinating topic to learn about.
3. Build ML Systems That Match Your Data Teams Size
One of the issues that constantly comes up is right-sizing your data infrastructure. A lot of data teams and companies read articles from Netflix, Facebook, or some other tech-focused company and think their data stack should look the same. But that’s not always the case.
Finding a data stack that matches your company's size, needs and focus is important. Not all companies have thousands of engineers who can focus on building new infrastructure and more importantly maintaining it. This is one of the many points that Mikiko Bazeley covered in her talk MLOps For LLMs. She points out that building MLOps, whether for traditional ML or LLMs is a process and it takes time. She does a great job walking us through that process in her video.
4. Business Patterns
We talk a lot about design patterns in the tech world. Whether you’re referring to programming, data, or general systems. We follow certain design patterns because they solve certain problems.
However, in Shachar Meir’s talk(How To Boost Your Impact As A Data Person) he brought up business patterns that are important to understand as a data person. Whether you’re a data engineer, data scientist, or analyst. Framing these terms as business patterns I believe provides the ability to abstract concepts like engagement, anomaly detection, ROI, etc to fit a broader set of problems. Regardless of the industry you’re working for. These patterns can be applied and you can fit similar models on top of them. Yes, you’ll still need to spend time defining what engagement might actually mean for posts vs videos vs shorts but once you have it, you can start to apply other models on top of it.
5. Coding Live - The Project Example Track
Finally, we had several individuals who were coding live. It was awesome to see people demonstrate their various skills and projects. You can look through the various live coding projects listed below.
One of those individuals was Ankit Virmani who gave the talk Designing Streaming Data Pipelines: Best Practices and Considerations. The talk will focus on the considerations while designing a scalable, and reliable streaming data platform and how it is different from the batch data platforms. In particular, he focused on GCP and using components like Dataflow and Pub/Sub to build a streaming pipeline. You should go check it out.
DEML Summit Other Talks
These were far from the only talks. We had several other coding projects again listed above. But also several other talks and panels that you can find links to below. These videos won’t be listed publicly on my Youtube channel because this tends to do weird things to the algorithm. So you’ll need to find the videos listed below.
Data Engineering Talks
Data Science And ML Talks
The Fastest Way to ACTUALLY Get Hired as a Data Scientist with NO Work Experience
Test Driven Development for Your Data, Models and Pipelines Strategies and Methods
Data Driven Business Transformation Unleashing the Power of Data Science in Your Organization
Special Thanks!
I want to give a few very special thanks. First to Xinran from Data Engineering Things who co-hosted the event with me it made the whole event go so much smoother! Also thank you to all the volunteers! They did so much both before, during, and after the event to make everyone's experience better.
Next, as someone who has put on events before, I want to say thank you to Accel Events. They made putting on this conference very easy and between their documentation and automation much of what I would have to do myself wasn’t an issue.
And of course, I want to say Thank You to our sponsors Decube, Keboola, Segment, and Onehouse! Their support is ensuring we can make this conference even bigger and better.
Articles Worth Reading
There are 20,000 new articles posted on Medium daily and that’s just Medium! I have spent a lot of time sifting through some of these articles as well as TechCrunch and companies tech blog and wanted to share some of my favorites!
6 Steps to Avoid Messy Data in Your Warehouse
Data is a mess at most companies. Data teams face multiple issues, such as handling human-edited Excels, duplicated metric definitions, incorrect source data, no data governance, lack of infrastructure, etc. If you
Wonder if data at every company is still a mess?
Hope to find a better place to work with better data development practices.
Believe analytic warehouses that are a complete, unmitigated nightmare?
Job hop hoping to find a mythical, “mature organization” where you’d learn how to do things right?
Believe that data warehouse is shit everywhere?
Then this post is for you! Imagine if your data warehouse runs like a well-oiled machine with correct data and is super easy to use! Your work will be satisfying, and your career growth will be quick; this is what the post aims to help you achieve.
In this post, we will go over six critical steps to building (& maintaining) a data warehouse that gives stakeholders everything they may need while avoiding messy data.
Behind the Rust Hype: What Every Data Engineer Needs to Know
by
Rust, Rust, Rust. It’s truly amazing how this language feels like it has taken the Data Engineering community by storm.
It’s the cool new and shiny toy everyone is playing with. Can we just blame Polars for this newfound obsession? But is it just a toy? Is there a future for Rust in everyday Data Engineering? Is Python in danger?
That’s what we are going to look into today. Like one of the inquisitive hobbits peering into the Lost Seeing Stones, we will try to divine the future for the hoards of Python Data Engineers. Do they have anything to fear? Should they pay attention to Rust? What does the future hold?
Let’s dive in.
End Of Day 105
Thanks for checking out our community. We put out 3-4 Newsletters a week discussing data, tech, and start-ups.
Hi, the link to watch the talk- Data Driven Business Transformation Unleashing the Power of Data Science in Your Organization - is broken . Can you please check?
Hi, can you fix the link to "Importance of data reability"? Thank you!