Abstract: At YooMoney, we use ML Models extensively for different tasks from Anti-fraud to NLP.
We started with a Data Scientist who used jupyter and then copy-pasted model code to flask vs pickle and zipped it for production. But it was a labor-intensive and hardly sclable process, so we begin to introduce MLOps.
My talk will cover MLOps practices—a way to streamline the model development process and automate it as much as possible. In general, at least some of them are an attempt to use Software Development practices in Machine Learning experimentation and production.
From one point of view, this task was relatively easy for our company: we already have CI/CD in place for regular applications, so why not just use them for ML purposes?
But when it comes to implementation, one might understand that it is not such a straightforward process. I will go through the main stages of the MLOps pipeline, explaining the challenges and solutions to overcome them.
The first stage is Model Development. On the one hand, it looks like regular software development (writing some code), but on the other hand, it doesn't as it requires access to a lot of datasets and DWHs (preferably, with live data - in case of Fintech, as well as Medical data, it might be challenging), so we have to solve Security issues like introducing IDM interfaces for 'sets of datasets".
There are some issues with code writing tools on this stage: Data Scientists often write code in jupyter notebooks instead of IDEs like Idea/Eclipse/VS, which is not directly suitable for creating a standalone application and requires some additional effort on the merge phase during commit. We manage to solve it with the jupytext module, it helps sync both ways between py and ipynb, storing py-files as the main reference in git.
The next stage is Preparing for Production. It starts with Model Risk Evaluation: we will briefly mention the probability-impact matrix, and name the risks that can be mitigated using MLOps like operational or data drift.
QA (testing practices suitable for ML) also plays an important role in the MLOps Process. As we aim to automate model’s lifecycle, short testing period will definitely help achieve this aim. The cheapest and most straightforward solution here is to use a tool like pytest, but it only works until other platforms like Scala applications are introduced, so alternatives like Kotlin autotests should be considered. As for the testing strategy, a few solutions can be used here (testing on a reference dataset, accessing ground truth, checking business metrics, and so on).
After that, we proceed to building the environment for real-time inference. As OReilly’s “Introduction to MLOps” suggests, this part of the process should come as early as possible, before the model is prepared (or even before model development starts). With basic modules like scikit it is relatively simple process, but when it comes to Tensorflow, for example - situation changes (who have tried just run latest version of tensorflow with latest version of python and all the libraries? How often it works with a first attempt?). So here we have to solve platform-specific problems (like it is better to use virtualenv or docker for dependencies) and more general ones: what kind of tools will be used for inference platform management. Should it be a single machine, Kubernetes cluster, or whatever? Our current solution is horizontal scaling with a balancer, and we’re aiming to use docker under Kubernetes as the target platform.
Deploying also has a lot of things to consider. We start with building artifacts for deployment. What should come with a model, just a serialized object? Or a reproducible research set including Data? We started with pickle / hdf5, and they suited perfectly until Scala models were introduced. Now we have to switch to another technology choosing different formats such as PMML, PFA, ONNX, or POJO (main pros and cons will be discussed during the presentation).
Release (which is not the same as deployment). At this stage, we start using the new model. Some Risks discussed above might be mitigated here: for example, the operational one can be dealt with using Canary releases or blue-green releases.
Last but not least, we might combine a few of the above discussed techniques in Monitoring. Here we need a Model Repository or some other way of detecting the model version (writing this version to a log works well too). Also, testing practices can be used to ensure that the model performs well in the production environment (from heartbeat/ping request till human evaluation for a portion of requests).
The main idea for MLOps is something similar to DevOps: identify labor-intensive parts of the Model lifecycle, choose the ones that can be automated and match appropriate tools for these parts (example was given above). This approach makes Model development more predictable and ensures that highly-qualified people like Data Scientists or Subject Matter Experts can focus on their specific task, leaving almost all infrastructure-related tasks to automated tools.
Bio: Evgenii is the Head of Data Engineering and Data Science team at YooMoney, the leading payment service provider on the CIS market. Evgeny and his team have completed a wide range of projects including an accounting system based on blockchain technologies (as an analyst), a BRE+ML-based antifraud engine (as an architect and project manager), Business Intelligence solutions (as a developer, analyst, architect, project manager), and many others. Currently, Evgenii participates in ML projects as ML Architect and Project Manager.