
Abstract: Distributed computing engines such as Spark, Dask, and Ray allow data practitioners to scale their data processing over a cluster of machines. Data practitioners coming from a Pandas mindset often have a difficult time handling use cases that don’t fit DataFrame semantics. In this talk, we will
- Clarify the so called unstructured data is not the criteria. It is a common misunderstanding unstructured data is not a fit for dataframe. We will use examples to illustrate different cases may or may not fit the dataframe concept.
- Explain the true reasons when dataframe is not a good fit. We will show a few scenarios where using dataframe is either an overkill or insufficient.
- Look at other distributed data structures such as Bag (RDD), Future, and Array. We will discuss their pros and cons and the most appropriate scenarios to use each of them.
We will also discuss the mindsets and best practices helping us make the right choices:
- Separating orchestration and compute layers
- Reducing data footprint and transfer between nodes
- Tuning worker cores utilization
Background Knowledge:
Attendees should have good understanding of Python, but no expert knowledge required for Spark, Dask and Ray
Bio: Han Wang is the tech lead of Lyft Machine Learning Platform, focusing on distributed computing solutions. Before joining Lyft, he worked at Microsoft, Hudson River Trading, Amazon and Quantlab. Han is the creator of the Fugue project, aiming at democratizing distributed computing and machine learning.