A Bamboo of Pandas: Crossing Pandas’ Single-machine Barrier with Apache Spark


Pandas is a fast and powerful open-source data analysis and manipulation framework written in Python. Apache Spark is an open-source unified analytics engine for large-scale data processing. Both are widely adopted in the Data Engineering and Data Science communities.

Even though there’s a great value in combining them in terms of productivity, scalability and performance, it’s often overlooked.

In this talk, we will show you how you can leverage recent developments in Apache Spark together with Pandas to enjoy the best of both worlds!

We will cover the following topics:
* Core concepts
* The motivation for combining Spark and Pandas
* Using Pandas code with Spark
* The new Pandas API in Spark


Daniel Haviv has been working with a multitude of companies helping them solve their data challenges throughout his career, recently as a Senior Solutions Architect for Databricks and as an Analytics Specialist SA in AWS.

Open Data Science




Open Data Science
One Broadway
Cambridge, MA 02142

Privacy Settings
We use cookies to enhance your experience while using our website. If you are using our Services via a browser you can restrict, block or remove cookies through your web browser settings. We also use content and scripts from third parties that may use tracking technologies. You can selectively provide your consent below to allow such third party embeds. For complete information about the cookies we use, data we collect and how we process them, please check our Privacy Policy
Consent to display content from - Youtube
Consent to display content from - Vimeo
Google Maps
Consent to display content from - Google