Abstract: Deep Learning (DL) is ubiquitous. Yet leveraging distributed memory systems for DL algorithms is incredibly hard. In this talk, we will present approaches to bridge this critical gap. We will start by scaling DL algorithms on large scale systems such as supercomputers, and cloud computing systems.
Specifically, we will:
1) present our TensorFlow and Keras runtime extensions which require negligible changes in user-code for scaling DL implementations,
2) present communication-reducing/avoiding techniques for scaling DL implementations, 3) present approaches on leveraging memory architectures for DL implementations, and 4) present research on semi-automatic generation of DNN topologies. Our results will include validation on several US supercomputer sites such as Berkeley's NERSC, Oak Ridge Leadership Class Facility, PNNL Institutional Computing and results on AWS. We will provide pointers and discussion on the general availability of our research under the umbrella of Machine Learning Toolkit on Extreme Scale (MaTEx) available at http://github.com/matex-org/matex.
Bio: Abhinav Vishnu is a chief scientist and team lead for scalable machine learning at Pacific Northwest National Laboratory. He focuses on designing extreme scale Deep Learning algorithms which are capable of execution on supercomputers and cloud computing systems. The specific objectives are to design user-transparent distributed TensorFlow; novel communication
reducing/approximation techniques for DL algorithms; fault tolerant Deep Learning/Machine Learning algorithms; multi-dimensional deep neural networks and applications of these techniques on several domains such as high energy physics, computational chemistry and general computer vision tasks. His research is publicly available as Machine Learning Toolkit for Extreme Scale (MaTEx) at http://github.org/matex-org/matex "