Discussion about this post

User's avatar
MattB's avatar

I love to see articles that allow people to learn about the existence of Polars and its advantages (so thank you for that), but I think the way you set up your benchmark is a bit disappointing. First, you point out that Polars lies between Pandas and Spark, yet the example file used is only ~150k rows after filtering - Pandas should always be plenty fast when dealing with a small amount of data. Second, Polars can utilize lazy evaluation to optimize your transformation, which isn't talked about or used in the example. Here's what the code would look like using your example:

import time

import polars as pl

s = time.time()

df = (pl

.scan_csv(r'Z:\202301-divvy-tripdata.csv')

.filter(pl.col('member_casual') == "member")

.groupby(by='start_station_id')

.agg(pl.col('ride_id').count())

.sort('start_station_id')

)

print(df.collect())

print(time.time() - s)

Last, there are some great benchmarks that can be seen here: https://h2oai.github.io/db-benchmark/

Expand full comment
0xEvan's avatar

I love polars over pandas - on the dataset I was using which was feature rich, polars joins felt superior both performance and usability wise.

I ran into a few small problems where I had to revert back to pandas, but nothing a bit of hacky ducktape code can't fix

Expand full comment
3 more comments...

No posts