Abstract: Linear algebra problems form the heart of many machine learning computations, but the demands on linear algebra from scientific machine learning problems can be different than for internet and social media applications. In particular, the need for efficient and scalable numerical linear algebra and machine-learning implementations continues to grow with the increasing importance of big data analytics. Since its introduction, Apache Spark has become an integral tool in this field, with attractive features such as ease of use, interoperability with the Hadoop ecosystem, and fault tolerance. However, it has been shown that numerical linear algebra routines implemented using MPI, a tool for parallel programming commonly used in high-performance computing, can outperform the equivalent Spark routines by an order of magnitude or more. We will describe these evaluations. These consist of exploring the trade-offs of performing linear algebra for data analysis and machine learning using Apache Spark, compared to traditional C and MPI implementations on HPC platforms. Spark is designed for data analytics on cluster computing platforms with access to local disks and is optimized for data-parallel tasks. We examine three widely-used and important matrix factorizations: NMF (for physical plausibility), PCA (for its ubiquity) and CX (for data interpretability). We apply these methods to terabyte-sized problems in particle physics, climate modeling and bioimaging, as use cases where interpretable analytics is of interest. Many of these algorithms use randomization in novel ways, and we will describe some of the underlying randomized linear algebra techniques. Finally, we'll describe Alchemist, a system for interfacing between Spark and existing MPI libraries that is designed to address this performance gap. The libraries can be called from a Spark application with little effort, and we illustrate how the resulting system leads to efficient and scalable performance on large datasets. We describe use cases from scientific data analysis that motivated the development of Alchemist and that benefit from this system. We'll also describe related work on communication-avoiding machine learning, optimization-based methods that can call these algorithms, and extending Alchemist to provide an ipython notebook <=> MPI interface.
Bio: Michael Mahoney is at the University of California at Berkeley in the Department of Statistics and at the International Computer Science Institute (ICSI). He works on algorithmic and statistical aspects of modern large-scale data analysis. Much of his recent research has focused on large-scale machine learning, including randomized matrix algorithms and randomized numerical linear algebra, geometric network analysis tools for structure extraction in large informatics graphs, scalable implicit regularization methods, and applications in genetics, astronomy, medical imaging, social network analysis, and internet data analysis. He received him PhD from Yale University with a dissertation in computational statistical mechanics, and he has worked and taught at Yale University in the mathematics department, at Yahoo Research, and at Stanford University in the mathematics department. Among other things, he is on the national advisory committee of the Statistical and Applied Mathematical Sciences Institute (SAMSI), he was on the National Research Council's Committee on the Analysis of Massive Data, he runs the biennial MMDS Workshops on Algorithms for Modern Massive Data Sets, and he spent fall 2013 at UC Berkeley co-organizing the Simons Foundation's program on the Theoretical Foundations of Big Data Analysis.
Michael Mahnoney, PhD
Statistics Professor | UC Berkeley