Spark, Dask, DuckDB, Polars: TPC-H Benchmark Results


Large scale dataframe computations are critical for efficient and friendly data manipulation at scale. This space has blown up recently and there are many new choices.

In this talk we run major contenders (Spark, Dask, DuckDB, Polars) through the TPC-H benchmarks both locally and on the cloud at various scales ranging from 10GB to 10TB and see how they perform. This will teach us both about these specific libraries and also about how to measure and think through performance on the cloud.

We'll think through topics like IO bandwidth, CPU saturation, memory constraints, as well as challenges in deployment and hardware selection. We'll bring in hardware and networking costs to get a sense for overall cost efficiency in computation.

The presenters are biased towards Dask, so we'll use that project to dive a bit deeper into tuning and what's critical, but the overall results should be broadly interesting to anyone in the data infrastructure space.


Bio Coming Soon!

Open Data Science




Open Data Science
One Broadway
Cambridge, MA 02142

Privacy Settings
We use cookies to enhance your experience while using our website. If you are using our Services via a browser you can restrict, block or remove cookies through your web browser settings. We also use content and scripts from third parties that may use tracking technologies. You can selectively provide your consent below to allow such third party embeds. For complete information about the cookies we use, data we collect and how we process them, please check our Privacy Policy
Consent to display content from - Youtube
Consent to display content from - Vimeo
Google Maps
Consent to display content from - Google