Abstract: Pandas is a fast and powerful open-source data analysis and manipulation framework written in Python. Apache Spark is an open-source unified analytics engine for large-scale data processing. Both are widely adopted in the Data Engineering and Data Science communities.
Even though there’s a great value in combining them in terms of productivity, scalability and performance, it’s often overlooked.
In this talk, we will show you how you can leverage recent developments in Apache Spark together with Pandas to enjoy the best of both worlds!
We will cover the following topics:
* Core concepts
* The motivation for combining Spark and Pandas
* Using Pandas code with Spark
* The new Pandas API in Spark
Bio: Daniel Haviv has been working with a multitude of companies helping them solve their data challenges throughout his career, recently as a Senior Solutions Architect for Databricks and as an Analytics Specialist SA in AWS.