5 Comments

I love to see articles that allow people to learn about the existence of Polars and its advantages (so thank you for that), but I think the way you set up your benchmark is a bit disappointing. First, you point out that Polars lies between Pandas and Spark, yet the example file used is only ~150k rows after filtering - Pandas should always be plenty fast when dealing with a small amount of data. Second, Polars can utilize lazy evaluation to optimize your transformation, which isn't talked about or used in the example. Here's what the code would look like using your example:

import time

import polars as pl

s = time.time()

df = (pl

.scan_csv(r'Z:\202301-divvy-tripdata.csv')

.filter(pl.col('member_casual') == "member")

.groupby(by='start_station_id')

.agg(pl.col('ride_id').count())

.sort('start_station_id')

)

print(df.collect())

print(time.time() - s)

Last, there are some great benchmarks that can be seen here: https://h2oai.github.io/db-benchmark/

Expand full comment

I love polars over pandas - on the dataset I was using which was feature rich, polars joins felt superior both performance and usability wise.

I ran into a few small problems where I had to revert back to pandas, but nothing a bit of hacky ducktape code can't fix

Expand full comment

>So is Pandas faster? Who knows. Most others have gotten the opposite results when performance testing Pandas, Polars, and PySpark against each other.

This is lazy writing. If you are going to make an article please give more appropriate insight than "who knows".

Expand full comment

I'd be happy if you covered more tools in the space to be honest. Knowing what some of the top products are is when I get most excited to read you.

Expand full comment

Love it! I always thought Pandas was awkward too. It was a great v0 for data analysis in Python but I think Polars is going to be very exciting.

Expand full comment