
Abstract: Machine learning is usually taught from tutorials using small, clean datasets put into data-frames and orchestrated with Jupyter notebooks; all done in one, in-memory, local environment. This is a fine style for presenting a new topic and teaching the main ideas, but unfortunately, these patterns are not conducive to the delivery of real production applications at scale. Real industrial situations involve multiple environments and data sets from databases or other data stores rather than file-based input. They interact with live production systems and must be coordinated with software delivery teams and product owners. They must be production quality, with good design, well-tested and maintainable. This often results in data scientists having to choose between the environment that they are used to, and one that is suitable for delivery to production; and an awkward migration from one to the other. In this workshop, we show how to maintain data science productivity as well as collaborate effectively and deliver value continuously and seamlessly. We demonstrate and guide the participants through CI/CD practices for machine learning and a new pattern of working that avoids most of the pitfalls of the typical approach.
Participants will learn how to utilize new patterns of repeatable continuous model development to collaborate effectively and deliver value continuously and seamlessly in industrial data science projects using Continuous Integration (CI) and Continuous Delivery (CD) practices.
● Github;
● Docker;
● Jenkins;
● Jupyter;
● Python;
● DVC;
● MLFlow;
● Kibana;
● ElasticSearch;
https://github.com/thoughtworksInc/CD4ML-Scenarios
https://drive.google.com/open?id=1QtJljTqRqR5E-GfgpGPZTaWyroXNcXFZ
Bio: Eric Nagler serves ThoughtWorks, Inc. as a Lead Data Engineer with eight years of developing innovative batch and real-time data solutions for multiple different clients in multiple different domains. Eric holds a Masters Degree in Computer Science with a focus in Parallel and Distributed Algorithms. Eric’s data interests include Big Data Application Architecture and Design, Parallel Computation, Big Data ETL, NoSQL Data Storage, Natural Language Processing, Geospatial Analytics, Web Services, and Data Visualization.

Eric Nagler
Title
Lead Data Engineer | ThoughtWorks, Inc.
