Distributed Training with PyTorch Lightning


This session discusses the fundamentals of distributed training and how scaling works with ML model training. Then we'll look into the PyTorch Lightning framework's core components. Using these components, we demonstrate how to implement a simple model and scale it with different distributed strategies and accelerators with ease without worrying about the hassles of engineering.

Accelerator refers to the hardware being used by PyTorch Lightning for training and inference applications. Currently, PyTorch Lightning supports several accelerators: CPUs, GPUs, TPUs, IPUs, and HPUs. We will go over some of the accelerators in depth.

As an ML practitioner, you'd like your focus more on research rather than engineering logic around hardware.

We'll show you how to easily scale your training for a large dataset across several accelerators such as GPUs and TPUs. We'll also go through the essential API internals of how PyTorch Lightning succeeds in abstracting the accelerator logic from users with support for distributed strategies, allowing them to focus on writing accelerator-agnostic code.

Outline of the Session

Part 1: Fundamentals of Distributed Training
We will discuss the core principles of distributed training in machine learning. We'll also talk about why we need it and why it's so complicated. Then, we'll go over two fundamental approaches to distributed training in depth: Data and Model Parallelism.

Part 2: An Introduction to the PyTorch Lightning Core Components
This section will go through Pytorch Lightning's core building components and how they fit into the typical research/data scientist development pipeline. We'll show you how to organize your PyTorch research code in LightningModule and go through the feature-rich PyTorch Lightning Trainer to help you supercharge your ML pipeline.

Part 3: Using PyTorch Lightning at Scale
Learning objectives:

Who is it aimed at?
Data scientists and ML engineers, who may or may not have used PyTorch Lightning in the past and wish to use distributed training for their models.
What will the audience learn by attending the session?
Get started with PyTorch Lightning
Get an overview of Distributed Training and several ML accelerators
Train a model with PyTorch Lightning using different accelerators and strategies.

Background Knowledge:
Some familiarity with Python, deep learning terminology, and the basics of neural networks.


Kaushik Bokka is a Senior Research Engineer at Grid.ai and one of the core maintainers of the PyTorch Lightning library. He has prior experience in building production scale Machine Learning and Computer Vision systems for several products ranging from Video Analytics to Fashion AI workflows. He has also been a contributor for few other open source projects and aims to empower the way people and organizations build AI applications.

Open Data Science




Open Data Science
One Broadway
Cambridge, MA 02142

Privacy Settings
We use cookies to enhance your experience while using our website. If you are using our Services via a browser you can restrict, block or remove cookies through your web browser settings. We also use content and scripts from third parties that may use tracking technologies. You can selectively provide your consent below to allow such third party embeds. For complete information about the cookies we use, data we collect and how we process them, please check our Privacy Policy
Consent to display content from - Youtube
Consent to display content from - Vimeo
Google Maps
Consent to display content from - Google