Incremental Adoption of Spark Dask and Ray

Abstract: 

Using Spark, Dask, or Ray is not an all-or-nothing thing. It may seem daunting for new practitioners expecting to translate existing Pandas pipelines to these big data frameworks. In reality, distributed computing can be incrementally adopted. There are many use cases where only one or two steps of a pipeline require expensive computation. This talk covers the strategies and best practices around moving portions of workloads to distributed computing through the open-source Fugue project. The Fugue API has a suite of standalone functions compatible with Pandas, Spark, Dask, and Ray. Collectively, these functions allow users to scale any part of their pipeline when ready for full-scale production workloads on big data.

Session Outline:

Data practitioners often find Pandas becoming a bottleneck in their data workloads. Either data becomes too big for Pandas to handle effectively, or computationally expensive workloads can take hours to run. These scenarios call for distributed computing frameworks such as Spark, Dask, or Ray to accelerate. The problem then becomes how to migrate already existing code to these frameworks that have different syntax. In this talk, we approach the migration in a strategic way, only moving portions that truly benefit from additional resources. These are cases like:

Training several machine learning models in parallel on small data
Expensive feature engineering on each group of data, but small machine learning models
Downsampling big data to smaller data with stratified sampling

In all the above scenarios, only a single step requires cluster and distributed computing. The Fugue open-source project focused on this over the last two years, allowing data practitioners to port code to Spark with one function call. This function was called transform(). However, users still faced friction in still having to write Spark, Dask, or Ray code around the transform() step in several cases. The learning curve to utilize the these engines was still present.

As a result, Fugue released a new API with 60 standalone functions that are compatible across Pandas, Spark, Dask, and Ray. They collectively provide a minimal interface for users to distribute some steps of their pipelines. By being both compatible with Pandas and big data frameworks, users can still retain the quick iteration speed of Pandas. When ready for production, the code can be executed on a cluster with just one line of code change.

The newly released Fugue API is intuitive and incrementally adoptable, removing the need to fully learn distributed computing framework to move workloads to the distributed setting. In this talk, we'll show real workloads that can be written in a few lines of code.

Bio: 

Han Wang is the tech lead of Lyft Machine Learning Platform, focusing on distributed computing solutions. Before joining Lyft, he worked at Microsoft, Hudson River Trading, Amazon and Quantlab. Han is the creator of the Fugue project, aiming at democratizing distributed computing and machine learning.

Open Data Science

 

 

 

Open Data Science
One Broadway
Cambridge, MA 02142
info@odsc.com

Privacy Settings
We use cookies to enhance your experience while using our website. If you are using our Services via a browser you can restrict, block or remove cookies through your web browser settings. We also use content and scripts from third parties that may use tracking technologies. You can selectively provide your consent below to allow such third party embeds. For complete information about the cookies we use, data we collect and how we process them, please check our Privacy Policy
Youtube
Consent to display content from Youtube
Vimeo
Consent to display content from Vimeo
Google Maps
Consent to display content from Google