Build Classification and Regression Models with Spark on AWS


Join us for an immersive session focused on optimizing PySpark and harnessing the power of machine learning using Spark MLlib. In this hands-on session, we will cover a wide range of topics, from understanding the project overview and core machine learning concepts to diving into the implementation of various classification algorithms. Through practical examples and demonstrations, you will gain a solid understanding of PySpark MLlib and its capabilities. Unsupervised learning techniques, such as K-Means clustering, will be explored alongside different types of classification algorithms, including decision tree and random forest classifiers.

Additionally, we will delve into essential data preprocessing techniques, such as changing column data types and handling missing values, ensuring your data is ready for analysis. You will also learn how to effectively split your data into training and testing datasets and validate your machine learning models using PySpark. Don't miss this opportunity to enhance your PySpark skills and unlock the full potential of Spark MLlib in your machine learning projects.

Session Outline:

Getting started with Machine Learning on Big Data
Why Spark on cloud for Machine Learning
Familiarize yourself with different types of classification algorithms.
Code walkthrough starting from data preparation to model deployment
Explore methods to validate machine learning models in PySpark.

Learning objectives:

Introduction to PySpark MLlib
Understanding the Unsupervised learning
Different types of Classification algorithms
Implementation of one of the classifier (K-Means, Random forest, etc)
Data processing using PySpark
Model building and with PySpark on AWS


Suman Debnath is a Principal Developer Advocate (Data Engineering) at Amazon Web Services, primarily focusing on Data Engineering, Data Analysis and Machine Learning. He is passionate about large scale distributed systems and is a vivid fan of Python. His background is in storage performance and tool development, where he has developed various performance benchmarking and monitoring tools.

Open Data Science




Open Data Science
One Broadway
Cambridge, MA 02142

Privacy Settings
We use cookies to enhance your experience while using our website. If you are using our Services via a browser you can restrict, block or remove cookies through your web browser settings. We also use content and scripts from third parties that may use tracking technologies. You can selectively provide your consent below to allow such third party embeds. For complete information about the cookies we use, data we collect and how we process them, please check our Privacy Policy
Consent to display content from - Youtube
Consent to display content from - Vimeo
Google Maps
Consent to display content from - Google