Introduction to Large-scale Analytics with PySpark


The amount of data being generated today is staggering and growing. Apache Spark has emerged as the de facto tool to analyze big data over the last few years and is now a critical part of the data science toolbox. This workshop will introduce you to the fundamentals of PySpark, Spark's Python API, and other best practices in Spark programming. The world of distributed analytics and machine learning is vast and exciting. This session intends to act as a gateway to it.

Session Outline:

- Module 1: Basics of PySpark and the DataFrame API
Our goal will be to set up PySpark and get familiar with it. Focus will be on the DataFrame API. We will also understand what use cases is PySpark a good fit for.

- Module 2: Techniques for working with real-world datasets
We parse, preprocess and analyze couple of big datasets. In addition to the DataFrame API, we will work with SparkSQL. It's one of PySpark's best features. To depict the versatility of the PySpark ecosystem, we will also work with textual data using the Spark NLP library.

- Module 3: Building an end-to-end data analytics pipeline
We will use the knowledge gained during the previous modules to analyze and model a real-world dataset. You will be introduced to and work with PySpark's machine learning API, SparkML.

Background Knowledge:

- Required: Python programming - syntax and basics of package installation
- Nice-to-have: Familiarity with Jupyter notebooks, data science techniques such as aggregation and fundamentals of machine learning


Akash Tandon is co-founder and CTO of Looppanel where he builds software to help product teams record, store and analyze user research data. He is a co-author of Advanced Analytics with PySpark, published by O'Reilly. Previously, Akash worked as a senior data engineer at Atlan, SocialCops and RedCarpet where he built data infrastructure for enterprise, government and finance use-cases. He has also been a participant and mentor in the Google Summer of Code program with the R Project for Statistical Computing.

Open Data Science




Open Data Science
One Broadway
Cambridge, MA 02142

Privacy Settings
We use cookies to enhance your experience while using our website. If you are using our Services via a browser you can restrict, block or remove cookies through your web browser settings. We also use content and scripts from third parties that may use tracking technologies. You can selectively provide your consent below to allow such third party embeds. For complete information about the cookies we use, data we collect and how we process them, please check our Privacy Policy
Consent to display content from - Youtube
Consent to display content from - Vimeo
Google Maps
Consent to display content from - Google