Abstract: Despite the many amazing applications of statistics, machine learning, and visualization in industry, many attempts at doing "data science" are anything but scientific. Specifically, data science processes often lack reproducibility, a key tenet of science in general and a precursor to having true collaboration in a scientific community. In this session, I will discuss the importance of reproducibility and data provenance in any data science organization, and I will provide some practical steps to help data science organizations produce reproducible data analyses and maintain integrity in their data science applications. I will also demo a reproducible data science workflow that includes complete provenance explaining the entire process that produced specific results.
Bio: Daniel (@dwhitena) is a Ph.D. trained data scientist working with Pachyderm (@pachydermIO). Daniel develops innovative, distributed data pipelines which include predictive models, data visualizations, statistical analyses, and more. He has spoken at conferences around the world, teaches data science/engineering with Ardan Labs (@ardanlabs), maintains the Go kernel for Jupyter, and is actively helping to organize contributions to various open source data science projects.