There has been a lot of discussion and interest in the data engineering world around several specific libraries and languages. One of these libraries would have to be Polars.
So.
Why is Polars all the rage?
To answer this question, I pulled in Daniel Beach, who is both an experienced data engineer as well as the author behind the newsletter Data Engineering Central.
Daniel has spent well over the last decade working in data and puts together amazing articles such as, What's all the hype with DuckDB?
So I asked if he could help me answer the question, why all the interest in Polars all of a sudden?
So let’s dive in.
Remember when the new cool kid showed up at school? Enter Polars.
Every once in a while, a new tool breaks into the Data Engineering space that bubbles up from underneath the festering masses, like a groundswell, pushing up through the thick layers of muck and mire on social networks. It’s like a fresh spring of water in a parched and dry land; such is the new kid on the block, Polars, in these days of overdone marketing drivel.
The question is, why? What exactly is it about Polars that’s causing the average engineer to stop and take note?
There must be a reason for the popularity of Polars; otherwise, people simply wouldn’t talk about, or use it for that matter. And that is the key, isn’t it? Between someone’s marketing department incessantly pouring out endless amounts of shallow reasons why their fancy UI is the next Airflow, do people actually use it?
So, this is what lies ahead of us today. I will give you a general survey of Polars, the Python implementation.
Overview of Polars
Examples of using Polars
Performance
Polars - the 10,000-foot view.
So, let’s talk about Polars and cover the basics. What do you need to know about Polars if you’ve only heard about it and have no idea what all the noise is about? First, and probably the reason why it’s so popular, is that …
“Lightning-fast DataFrame library for Rust and Python”
DataFrames. Everyone loves DataFrames, especially all of us hungry data people always looking to gobble up larger and larger datasets; the DataFrame is our concept of choice. But what exactly is it about the Polars DataFrame API that makes it so popular?
It’s written in Rust, and therefore is way fast.
It provides a familiar DataFrame API with familiar functionality (similar to Spark, Pandas, etc.)
It doesn’t suffer from memory problems like Pandas.
Fully functional Python API.
The reason Polars is catching on is pretty straightforward–it provides a Python API build on top of superfast Rust, combined with reducing memory problems (common with Pandas) and a familiar DataFrame concept … it’s the logical replacement for those in the Pandas world who’ve been waiting for a new savior to come and take them away.
Of course, there are many more in-depth reasons for Polars being Polars, like the fact it’s built on top of Apache Arrow in-memory formats taking advantage of columnar formats, and therefore making aggregates and analytics fast.
But really, what IS it about Polars that has people excited?
Ok, ok, so it’s yet another DataFrame library. So what? Why are people excited? I think the answer has much to do with Pandas and Spark … strange, right? What does Polars have to do with Pandas and Spark? A lot, actually.
Data Engineering has been stuck in this weird twilight zone for a long time, between two worlds, the Pandas and the Spark. It’s big data … and Spark, Databricks, Snowflake, and BigQuery. Then your other option was Pandas on the far other end of the data spectrum. There was no middle ground.
But what about all the people stuck in between?! Data that is honestly too small for the overkill of Spark or BigQuery, but big enough to make Pandas a pain to use and untenable.
It was always simply a burden for Data Engineers. You had to pick Spark or something else, to be able to deal with data that didn’t even need Spark … but why? Because Pandas simply couldn’t perform and wasn’t always approachable.
Or, you picked Pandas because you were trying to save money, but it was slow, burdensome, didn’t scale, and was frankly a pain to build pipelines with. But that was life because there were no other good options. Sure, some poor souls tried to use Dask or something else, but there is a reason it never caught on.
But now we have Polars
So that’s why I think Polars is catching on and will continue to. Polars is stepping into the breach, raising its shiny sword to the sky, and entering the fray. It can offer a similar feel to Spark when developing pipelines. It can handle larger datasets than Pandas, it’s much faster … and it doesn’t require the complexity and overhead of Spark.
It fits perfectly into that last piece of your giant puzzle.
What does it look like to use Polars (examples)?
I will try to make this short and sweet, but I want to give you a sense of what you’re in for if you decide to go the Polars route. If you’ve used PySpark before, it will seem familiar, like an old friend. If you’ve only been using Pandas, you are in for a treat.
If you’re interested in more details about replacing Pandas with Polars, check out this article Replacing Pandas with Polars. A Practical Guide. We are going to blast through this section; hold on, Sunny Jim.
You can read and write files in Polars. Duh.
import polars as pl
df = pl.read_csv('some_file.csv')
df = pl.read_parquet('some_file.parquet')
# large data, you can scan as well, without reading into memory.
df = pl.scan_csv('some_file.csv')
df = pl.scan_parquet('some_file.parquet')
You can group by and aggregate
df = df.groupby(by='id').agg(pl.col("number").sum())
You can filter of course
df = df.filter((pl.col("action") == "yes")
Joins are a must
df_final = df_1.join(df_2, on="id", how="inner")
Dude, even window functions
result = df.select(
["Ding,
"Dong",
pl.col("number").count().over("Ding").alias("Ding_Count"),
pl.col("number").count().over("Dong").alias("Dong_Count")
]
)
The list goes on
We could be here all day. You get the point, at least if you are a PySpark or Pandas person.
Polars IS the easy version of Spark and the better version of Pandas. That’s the whole point. It doesn’t get any better. Frankly, you would be crazy to choose Pandas over Polars unless you have some “strange use” case. Also, if you work on data that can fit on a single node … save some money, throw Spark out the window, and get yourself some Polars!!
What about performance?
Everyone says it’s faster. Is it? Let’s use the Divvy Bike trip dataset, free to use. Let’s just simply read and filter a CSV file and see what happens. Nothing fancy here.
Polars code
import polars as pl
from datetime import datetime
t1 = datetime.now()
df = pl.read_csv('202301-divvy-tripdata.csv')
df = df.filter(pl.col('member_casual') == "member")
print(df.groupby(by='start_station_id').agg(pl.col('ride_id').count()).sort('start_station_id'))
t2 = datetime.now()
print(t2-t1)
Pandas code
import pandas as pd
from datetime import datetime
t1 = datetime.now()
df = pd.read_csv('202301-divvy-tripdata.csv')
df = df.loc[df['member_casual'] == 'member']
print(df.groupby('start_station_id').ride_id.agg(['count']).sort_values(by='start_station_id'))
t2 = datetime.now()
print(t2-t1)
So is Pandas faster? Who knows. Most others have gotten the opposite results when performance testing Pandas, Polars, and PySpark against each other. Performance is complex; it depends on how you write the code, the data size, and what you are actually trying to do.
Personally, I think Polars is a little more straightforward to write. Pandas has always been a little awkward.
Closing Thoughts
I hope, if anything, you got a little taste of why so many people are talking about Polars, even if you haven’t used it yourself. Next time you’re working with DataFrames, consider the switch. Give it a try.
Polars will be more intuitive, better on larger datasets, and even has a SQL Context for all you SQL die-hards out there. It’s a unique tool because it fits in the crack between Pandas and Spark. It can work on larger-than-memory datasets but doesn’t come with the complexity and overhead of Spark; it’s pretty much a Data Engineers’ dream come true.
Data Events You’re Not Going To Want To Miss!
Minneapolis Data Happy Hour - Tonight!
Data Reliability Engineering Conference 2023
New York City Data Happy Hour
Delivering Trusted Data with SMS
Join My Data Engineering And Data Science Discord
Recently my Youtube channel went from 1.8k to 55k and my email newsletter has grown from 2k to well over 36k.
Hopefully we can see even more growth this year. But, until then, I have finally put together a discord server.
I want to see what people end up using this server for. Based on how it is used will in turn play a role in what channels, categories and support are created in the future.
Articles Worth Reading
There are 20,000 new articles posted on Medium daily and that’s just Medium! I have spent a lot of time sifting through some of these articles as well as TechCrunch and companies tech blog and wanted to share some of my favorites!
Data Engineering Excellency at Netflix
It has been a little over two years since I joined Netflix as a data engineer, and I absolutely enjoyed every minute of my time here. Like I often told interview candidates: if you are passionate about data engineering, this team is where you want to be. Of course, life at Netflix isn’t always rainbow and sunshine. In this blog, I would like to share why (I believe) Netflix is a great place for data engineering, plus things that are not so great.
Measuring B2B Customer Satisfaction: How Dashlane Leverages a Unique, Data-Driven Approach
At Dashlane, we believe each of us should feel connected to the end user, regardless of our roles and level of daily customer interaction. To ensure this, one of Dashlane’s core objectives is customer satisfaction. Due to our French roots, we call this “Amour du Client,” meaning “love of the customer.”
There are many ways in which we live and breathe this objective, and some of us on the Product Analytics team wanted to share a recent example.
You can’t improve what you can’t measure. That’s why in 2022, the Product Analytics team wanted a better way to measure Amour du Client. We wanted a data-driven framework to help us monitor and understand how satisfied and engaged our customers are from the moment they first become aware of Dashlane to the moment they renew or upgrade…
End Of Day 73
Thanks for checking out our community. We put out 3-4 Newsletters a week discussing data, tech, and start-ups.
I love to see articles that allow people to learn about the existence of Polars and its advantages (so thank you for that), but I think the way you set up your benchmark is a bit disappointing. First, you point out that Polars lies between Pandas and Spark, yet the example file used is only ~150k rows after filtering - Pandas should always be plenty fast when dealing with a small amount of data. Second, Polars can utilize lazy evaluation to optimize your transformation, which isn't talked about or used in the example. Here's what the code would look like using your example:
import time
import polars as pl
s = time.time()
df = (pl
.scan_csv(r'Z:\202301-divvy-tripdata.csv')
.filter(pl.col('member_casual') == "member")
.groupby(by='start_station_id')
.agg(pl.col('ride_id').count())
.sort('start_station_id')
)
print(df.collect())
print(time.time() - s)
Last, there are some great benchmarks that can be seen here: https://h2oai.github.io/db-benchmark/
I love polars over pandas - on the dataset I was using which was feature rich, polars joins felt superior both performance and usability wise.
I ran into a few small problems where I had to revert back to pandas, but nothing a bit of hacky ducktape code can't fix