
Abstract: The amount of data being generated today is staggering and growing. Apache Spark has emerged as the de facto tool to analyze big data over the last few years and is now a critical part of the data science toolbox. In recent years, text data is increasingly becoming more common as new techniques to work with them become popular.
This workshop will introduce you to the fundamentals of PySpark (Spark's Python API), the Spark NLP library and other best practices in Spark programming when working with textual or natural language data.
Session Outline:
- Module 1: Basics of PySpark and the DataFrame API
Our goal will be to set up and get familiar with PySpark API. Focus will be on the DataFrame API and basic data operations such as filtering, aggregating and grouping. We will also understand what use cases is PySpark a good fit for.
- Module 2: PySpark for NLP
In this module, we'll discuss using PySpark for NLP tasks such as entity recognition and sentiment analysis. We'll cover how to load, preprocess, and analyze text data using PySpark. We'll also discuss when to use PySpark for NLP tasks and when to consider other Python NLP libraries.
We'll introduce Spark NLP, a popular NLP library built on top of PySpark. The hands-on exercise will demonstrate how to perform text preprocessing and feature extraction with Spark NLP.
- Module 3: Advanced NLP with Spark NLP
We'll discuss Spark NLP's capabilities, advantages, and integration with PySpark. We'll also demonstrate how to use Spark NLP for a task such as entity recognition or sentiment analysis.
Background Knowledge:
Required: Python programming - syntax and basics of package installation
Nice-to-have: Familiarity with Jupyter notebooks; basics of natural language processing and data science techniques such as aggregation.
Bio: Akash Tandon is co-founder and CTO of Looppanel where he builds software to help product teams record, store and analyze user research data. He is a co-author of Advanced Analytics with PySpark, published by O'Reilly. Previously, Akash worked as a senior data engineer at Atlan, SocialCops and RedCarpet where he built data infrastructure for enterprise, government and finance use-cases. He has also been a participant and mentor in the Google Summer of Code program with the R Project for Statistical Computing.

Akash Tandon
Title
Co-Founder | Co-author, Advanced Analytics with PySpark | Looppanel | O'Reilly Media
