Data Science in the Industry: Continuous Delivery for Machine Learning

Abstract: 

Machine learning is usually taught from tutorials using small, clean datasets put into data-frames and orchestrated with Jupyter notebooks; all done in one, in-memory, local environment. This is a fine style for presenting a new topic and teaching the main ideas, but unfortunately, these patterns are not conducive to the delivery of real production applications at scale. Real industrial situations involve multiple environments and data sets from databases or other data stores rather than file-based input. They interact with live production systems and must be coordinated with software delivery teams and product owners. They must be production quality, with good design, well-tested and maintainable. This often results in data scientists having to choose between the environment that they are used to, and one that is suitable for delivery to production; and an awkward migration from one to the other. In this workshop, we show how to maintain data science productivity as well as collaborate effectively and deliver value continuously. We guide participants through CI/CD practices for machine learning and a pattern of working that avoids most of the common pitfalls.

Session Outline:
The training and instructions can be found in https://github.com/ThoughtWorksInc/CD4ML-Scenarios

Introduction
First an introduction to MLOps and how we approach it through what we call Continuous Delivery for Machine Learning. We'll introduce how we've chosen this problem and put together the workshop as an example of how to apply MLOps principles.

Part 1 - System setup
The workshop involves various tools running in docker. The workshop also requires working in a forked github repository and using a personal access token. We'll walk you through this and at the end of this part you will have your docker and github setup completed.

Part 2 - Jenkins setup
Here you'll setup and configure a deployment pipeline to build and deploy the application to production.

Part 3 - Machine Learning on the Zillow Housing Problem
Learn to do experiments without interrupting the rest of your team or changing the production model. Learn about the codebase design and how it enables flexibility while maintaining reproducibility.

Part 4 - Continuous Delivery
Learn about the principles of Continuous Deployment. Demonstrate a CD quality check, ensuring that our changes don't impact production applications.

Part 5 - Model Monitoring and Observability
Configure and deploy our application to log prediction events to Elastic Search. Visualize events on Kibana dashboard. Learn how to close the data feedback loop.

Background Knowledge:
Docker must be installed locally and it must be possible to allocate it 4GB RAM (workshop shows how to do this). Github account is required. Must be able to fork the repo https://github.com/ThoughtWorksInc/CD4ML-Scenarios

Bio: 

Andy Symonds is a technologist passionate about using data science in new and interesting ways. With a background in academia before moving into consulting, he loves problem solving and experimentation. Andy works with clients to help them gain insights and drive business value, by developing proof of concepts and moving these solutions into production.

Open Data Science

 

 

 

Open Data Science
One Broadway
Cambridge, MA 02142
info@odsc.com

Privacy Settings
We use cookies to enhance your experience while using our website. If you are using our Services via a browser you can restrict, block or remove cookies through your web browser settings. We also use content and scripts from third parties that may use tracking technologies. You can selectively provide your consent below to allow such third party embeds. For complete information about the cookies we use, data we collect and how we process them, please check our Privacy Policy
Youtube
Consent to display content from Youtube
Vimeo
Consent to display content from Vimeo
Google Maps
Consent to display content from Google