At CybelAngel, we are a growing team of data scientists and a machine learning engineer, planning to double in size. Each of us contributes to projects and we use shareable notebooks before code industrialization for production.

Notebooks: let’s focus on them. We talk specifically about Jupyter Notebooks, but alternatives exist. They are one of the most popular tools used by data scientists to develop and present their work. For those who have never seen one, a Notebook is an interactive coding environment that runs in the browser using a “kernel,” a different one according to the language. Technically, it is a REPL with “cells” in which you can run code and you can output the results directly after the cell, allowing you to investigate a variable value or a plot for example.

If the Notebook runs on a shared environment like the dedicated service on Google Cloud Platform (GCP), multiple users can access it simultaneously and contribute to it. This is an essential function when working within a team that has its drawbacks: you can’t follow who did what, when, and how a notebook evolved unless it is versioned. According to the actual implementation of the shared environment, contributing simultaneously can be tricky if users are indistinguishable as it happens on GCP Vertex AI workbench, for example.

Moreover, data science presents peculiarities compared to software development: there is a lot more uncertainty on the outcome of a use case as it strongly depends primarily on data and also on a machine learning model performance. Typically, a data science use case starts with data exploration. It can be quite a long process before tackling model development. Notebooks are the ideal tool to deal with data exploration and model development as they are iterative processes. This uncertain (because the usefulness of the outcome is unpredictable, not the process) and iterative dynamic fits particularly well with Notebooks and that is why they are hardly replaceable in the life cycle of a data science use case development.

Once the use case is “finished,” meaning it has proved its business added value, it comes the time to go to production. For production, you need reproducible and tested code. Standard software development techniques and way of work apply: you probably have CI/CD with a versioning platform in place. Well, Notebooks don’t fit so well into such a context.

Long story short, as for many data science teams, a lot of effort and valuable work is encapsulated in a format that is tricky to share, hardly reproducible, and simply not built for production purposes. On the other hand, you don’t want data scientists to stop using Notebooks because they have a tangible added value but you still need to go beyond.

During the talk at ODSC Europe 2022, titled “Reproducible and shareable notebooks across a data science team,” we will present what we did to overcome some of these issues and our feedback about implementation in Google Cloud Platform using JupyterHub and Jupytext. We also show how we combine notebooks for development and scripts into the same project that runs in the production environment.

About the authors/ODSC Europe 2022 speakers on Shareable Notebooks:

Giulia Bianchi is a senior data scientist at CybelAngel. She is passionate about reproducible data science and has a pragmatic approach. She believes in the importance of sharing knowledge and actively contributing to the community by speaking publicly and writing technical articles.

Mike Tapi Nzali, PhD, is a machine learning engineer at CybelAngel with a PhD in Computer Science. He likes to work in a startup environment, also leading the development of machine learning products from idea to production. He is interested in cutting-edge technology, sharing knowledge, and the industrialization of Machine Learning.