Abstract: This workshop will provide an overview of the process of building predictions from raw data using the Apache Mahout machine-learning library and the Apache Spark compute framework. We will present recent developments in Mahout including the matrix-math-oriented domain-specific language named “Samsara” as well as a new algorithm development framework. Then we will walk through the steps required to convert a data set to the format required for analytic methods, split the data, train and test a model, and then output predictions.
This approach takes advantage of a declarative Scala-based DSL that allows practitioners to operate on very large data sets while focusing on the math in their work symbolically rather than needing to acquire significant programming experience and skill to manipulate matrices and other structures. Attendees will be given example source code and pointers on getting started on their own projects.
Bio: Andrew Musselman is a member of the Apache Mahout Project Management Committee, an independent data engineering and analytics consultant, and co-host of the Adversarial Learning podcast. He loves distributed matrix math and lives in Seattle with his wife and kids.