Abstract: Reproducible research documents not just the findings of a study but the exact code required to produce those findings. Reproducible research is a requirement for study authors to reliably repeat their analysis or accelerate new findings by applying the same techniques to new data. The increased transparency allows peers to quickly understand and compare the methods of the study to other studies and can lead to higher levels of trust, interest and eventually more citations of your work. Big data introduces some new challenges for reproducible research. As our data universe expands and the open data movement grows, more data is available than ever to analyze, and the possible combinations are infinite. Data cleaning and feature extraction often involve lengthy sequences of transformations. The space allotted for publications is not adequate to effectively describe all the details, so they can be reviewed and reproduced by others. Fortunately, the open source community is addressing this need with Apache Spark, Zeppelin and Hadoop. Apache Spark 2.0 makes it even simpler and faster to harness the power of a Hadoop computing cluster to clean, analyze, explore and train machine learning models on large data sets. Zeppelin web-based notebooks capture and share code and interactive visualizations with others. After this session you will be able to create a reproducible data science pipeline over large data sets using Spark, Zeppelin, and a Hadoop distributed computing cluster. Learn how to combine Spark with other supported interpreters to codify your results from cleaning to exploration to feature extraction and machine learning. Discover how to share your notebooks and data with others using the cloud. This talk will cover Spark and show examples, but it is not intended to be a complete tutorial on Spark.
Bio: Carolyn Duby is a Solutions Engineer at Hortonworks where she helps customers harness the power of their data with Apache open source platforms. Prior to joining Hortonworks she was the architect for cyber security event correlation at SecureWorks. Ms. Duby earned a ScB Magna Cum Laude and ScM from Brown University in Computer Science. She recently completed the Johns Hopkins University Coursera Data Science Specialization. With a diverse experience working for small companies, startup companies, large companies, and for herself, she has a passion for challenging data intensive systems. For fun, she enjoys cooking, singing, horseback riding, and fitness.