Why Dataframe is not Always the Best Option for Distributed Computing

Abstract: 

Distributed computing engines such as Spark, Dask, and Ray allow data practitioners to scale their data processing over a cluster of machines. Data practitioners coming from a Pandas mindset often have a difficult time handling use cases that don’t fit DataFrame semantics. In this talk, we will

- Clarify the so called unstructured data is not the criteria. It is a common misunderstanding unstructured data is not a fit for dataframe. We will use examples to illustrate different cases may or may not fit the dataframe concept.
- Explain the true reasons when dataframe is not a good fit. We will show a few scenarios where using dataframe is either an overkill or insufficient.
- Look at other distributed data structures such as Bag (RDD), Future, and Array. We will discuss their pros and cons and the most appropriate scenarios to use each of them.

We will also discuss the mindsets and best practices helping us make the right choices:

- Separating orchestration and compute layers
- Reducing data footprint and transfer between nodes
- Tuning worker cores utilization

Background Knowledge:

Attendees should have good understanding of Python, but no expert knowledge required for Spark, Dask and Ray

Bio: 

Han Wang is the tech lead of Lyft Machine Learning Platform, focusing on distributed computing solutions. Before joining Lyft, he worked at Microsoft, Hudson River Trading, Amazon and Quantlab. Han is the creator of the Fugue project, aiming at democratizing distributed computing and machine learning.

Open Data Science

 

 

 

Open Data Science
One Broadway
Cambridge, MA 02142
info@odsc.com

Privacy Settings
We use cookies to enhance your experience while using our website. If you are using our Services via a browser you can restrict, block or remove cookies through your web browser settings. We also use content and scripts from third parties that may use tracking technologies. You can selectively provide your consent below to allow such third party embeds. For complete information about the cookies we use, data we collect and how we process them, please check our Privacy Policy
Youtube
Consent to display content from - Youtube
Vimeo
Consent to display content from - Vimeo
Google Maps
Consent to display content from - Google