I love to see articles that allow people to learn about the existence of Polars and its advantages (so thank you for that), but I think the way you set up your benchmark is a bit disappointing. First, you point out that Polars lies between Pandas and Spark, yet the example file used is only ~150k rows after filtering - Pandas should always be plenty fast when dealing with a small amount of data. Second, Polars can utilize lazy evaluation to optimize your transformation, which isn't talked about or used in the example. Here's what the code would look like using your example:
I love to see articles that allow people to learn about the existence of Polars and its advantages (so thank you for that), but I think the way you set up your benchmark is a bit disappointing. First, you point out that Polars lies between Pandas and Spark, yet the example file used is only ~150k rows after filtering - Pandas should always be plenty fast when dealing with a small amount of data. Second, Polars can utilize lazy evaluation to optimize your transformation, which isn't talked about or used in the example. Here's what the code would look like using your example:
import time
import polars as pl
s = time.time()
df = (pl
.scan_csv(r'Z:\202301-divvy-tripdata.csv')
.filter(pl.col('member_casual') == "member")
.groupby(by='start_station_id')
.agg(pl.col('ride_id').count())
.sort('start_station_id')
)
print(df.collect())
print(time.time() - s)
Last, there are some great benchmarks that can be seen here: https://h2oai.github.io/db-benchmark/
I love polars over pandas - on the dataset I was using which was feature rich, polars joins felt superior both performance and usability wise.
I ran into a few small problems where I had to revert back to pandas, but nothing a bit of hacky ducktape code can't fix
>So is Pandas faster? Who knows. Most others have gotten the opposite results when performance testing Pandas, Polars, and PySpark against each other.
This is lazy writing. If you are going to make an article please give more appropriate insight than "who knows".
I'd be happy if you covered more tools in the space to be honest. Knowing what some of the top products are is when I get most excited to read you.
Love it! I always thought Pandas was awkward too. It was a great v0 for data analysis in Python but I think Polars is going to be very exciting.