Why is Polars All The Rage?

Mar 1, 2023

And Can It Replace Pandas?

5 Comments

Mar 3, 2023

I love to see articles that allow people to learn about the existence of Polars and its advantages (so thank you for that), but I think the way you set up your benchmark is a bit disappointing. First, you point out that Polars lies between Pandas and Spark, yet the example file used is only ~150k rows after filtering - Pandas should always be plenty fast when dealing with a small amount of data. Second, Polars can utilize lazy evaluation to optimize your transformation, which isn't talked about or used in the example. Here's what the code would look like using your example:

import time

import polars as pl

s = time.time()

df = (pl

.scan_csv(r'Z:\202301-divvy-tripdata.csv')

.filter(pl.col('member_casual') == "member")

.groupby(by='start_station_id')

.agg(pl.col('ride_id').count())

.sort('start_station_id')

)

print(df.collect())

print(time.time() - s)

Last, there are some great benchmarks that can be seen here: https://h2oai.github.io/db-benchmark/

Expand full comment

0xEvan

Mar 2, 2023

I love polars over pandas - on the dataset I was using which was feature rich, polars joins felt superior both performance and usability wise.

I ran into a few small problems where I had to revert back to pandas, but nothing a bit of hacky ducktape code can't fix

Expand full comment

David Sobey

Mar 2, 2023

>So is Pandas faster? Who knows. Most others have gotten the opposite results when performance testing Pandas, Polars, and PySpark against each other.

This is lazy writing. If you are going to make an article please give more appropriate insight than "who knows".

Expand full comment