
Abstract: Spark provides an elegant API for developing machine learning pipelines that can be deployed seamlessly in production. However, one of the most intriguing and performant family of algorithms, deep learning, remain difficult for many groups to deploy in production, not least because of the need for tremendous compute resources, but also because of it's inherent difficulty in tuning and configuring. In this talk, we'll show how to deploy the Microsoft Cognitive Toolkit (CNTK), inside of Spark clusters on the Azure cloud platform. We'll discuss the key considerations for administering GPU-enabled Spark clusters, configuring such workloads for maximum performance, and techniques for distributed hyperparameter optimization. We'll illustrate a real-world example of training distributed deep learning learning algorithms for speech recognition and natural language processing. Moreover, we'll discuss some recent advances in the R APIs for Spark and CNTK, and show how a data scientist familiar with R can take their existing workloads and deploy them in distributed Spark clusters without knowing much about Spark at all! All examples will be available for you to try out on your own in Azure's HDInsight Spark environment.
Bio: Ali is a data scientist in the Algorithms and Data Science team at Microsoft. He focuses on making distributed computing in the cloud easier, more efficient, and more enjoyable for data scientists and developers alike. He works primarily on statistical computing with R and Spark, and scalable implementations of Bayesian learning algorithms. Ali studied stochastic analysis and statistical machine learning at the University of Toronto and Stanford University, with a focus on Bayesian learning and distributed implementations of Markov chain Monte Carlo algorithms.

Ali Zaidi
Title
Data Scientist at Microsoft
Category
east2017
