Reproducible and Shareable Notebooks Across a Data Science Team

Abstract: 

At CybelAngel we scan the internet looking for sensitive data leaks belonging to our clients.
As the volume of alerts could count billions of samples, we use machine learning to throw away as much noise as possible to reduce the analysts' workload.

We are a growing team of data scientists and a machine learning engineer, planning to double in size. Each of us contributes to projects and we use Notebooks before code industrialisation. As for many other data science teams, a lot of effort and valuable work is encapsulated in a format that is tricky to share, hardly reproducible and simply not built for production purposes. During the talk, we will present what we did to overcome some of these issues and our feedback about notebook versioning and implementation in Google Cloud Platform using JupyterHub and Jupytext.

This talk is addressed to a technical audience but all roles gravitating around a data team are welcome to grasp the challenges of the interaction of data science within the organisation.

We will be talking about the use of Jupytext, an open source python library that allows to version notebooks, and how we switched from Google Cloud Platform managed JupyterLab notebooks to JupyterHub running on Kubernetes to be able to actually use it. We will discuss it from a data science and a machine learning operation point of view to also highlight how the two disciplines go along together and are complementary.

From a technical point of view, we will show how Jupytext works under the hood and how we integrated it into our technical environment: JupyterHub, Google Cloud Platform and Gitlab. We will highlight pros and cons of our solution and possible alternatives. Finally, to give a comprehensive vision, we will also cover the challenges of integrating the change within the team.

Background Knowledge:
Attendees should be familiar with DS tools, versioning tools and industrialisation of machine learning model challenges.

Bio: 

Thomas Galtier, PhD, is the Data Science and Analytics Director at CybelAngel with a PhD in Applied Mathematics. He is passionate about data, especially “telling stories with the right data”.
As a former researcher he believes that reproducibility is a key concept in Machine Learning workflow and that sharing will allow us to tackle tomorrow's challenges in AI field.

Open Data Science

 

 

 

Open Data Science
One Broadway
Cambridge, MA 02142
info@odsc.com

Privacy Settings
We use cookies to enhance your experience while using our website. If you are using our Services via a browser you can restrict, block or remove cookies through your web browser settings. We also use content and scripts from third parties that may use tracking technologies. You can selectively provide your consent below to allow such third party embeds. For complete information about the cookies we use, data we collect and how we process them, please check our Privacy Policy
Youtube
Consent to display content from - Youtube
Vimeo
Consent to display content from - Vimeo
Google Maps
Consent to display content from - Google