
Abstract: This workshop will take one through the steps associated with an end-to-end machine learning campaign: data retrieval; data curation; model construction, evaluation, selection and interpretation; and reporting. Particular attention will be paid to reporting, i.e., building a narrative. Examples will be presented demonstrating how one might generate multiple output formats (e.g., HTML pages, presentation slides, PDF documents) starting with the same code base.
The practice of literate computing underpins this workshop. Literate computing weaves a narrative directly into an interactive computation. Text, code, and results are combined into a narrative that relies equally on textual explanations and computational components. Insights are extracted from data using computational tools. These insights are communicated to an audience in the form of a narrative that resonates with the audience. Literate computing lends itself to the practice of reproducible research. One may re-run the analyses; run the analyses with new data sets; modify the code for other purposes.
The workshop will have three main foci:
1. infrastructure: instantiating the computational environment; loading packages; loading data
2. computation: data curation, transformation, and analysis; model construction, evaluation, and interpretation
3. communication: creating tables, charts, and graphs; weaving all components into data narrative
As a specific example, a data narrative will be built showing how one might build predictive models for the toxicity of organic molecules. Reports will be presented as (1) an HTML file, (2) a PDF document (in a format acceptable for journal submission), and (3) a slide presentation.
While the workshop’s example comes from the field of cheminformatics, the computational tools used and the exercises presented are applicable to any field where an investigator is interested in building predictive models, and describing these models to colleagues and associates.
The R and Python ecosystems will be used throughout. All data, code, and text will be made available.
At the workshop’s conclusion attendees will have worked through exercises that may serve as templates to be used with their data as they build their data narratives.
● Jupyter notebooks;
● RStudio;
https://github.com/pjkowalczyk/StoredDataStories
https://drive.google.com/open?id=1egltdovhlUqFzhp_QVOdc5yKp0fWN3_D
Bio: Paul Kowalczyk is a Senior Data Scientist at Solvay. There, Paul uses a variety of toolchains and machine learning workflows to visualize, analyze, mine, and report data; to generate actionable insights from data. Paul is particularly interested in democratizing data science, working to put data products into the hands of his colleagues. His experience includes using computational chemistry, cheminformatics, and data science in the biopharmaceutical and agrochemical industries. Paul received his PhD from Rensselaer Polytechnic Institute, and was a Postdoctoral Research Fellow with IBM’s Data Systems Division.

Paul J Kowalczyk, PhD
Title
Senior Data Scientist | Solvay
