As the world is changing rapidly around us, it is often questionable whether something we learned from the past is still valid. Machine learning models that make predictions of the future based on past data points are probably under most scrutiny from businesses in the current climate. Close monitoring of these models in production is crucial to weathering the storm at this time.

At Zopa, we strive to provide the best consumer finance products in the UK market, including personal loans and credit cards. As a pioneer of the UK’s fintech industry, we have been at the forefront of innovation since 2005. Machine learning models have been an important technology advantage of our business, servicing our customers, and driving business decisions through credit risk evaluation, affordability assessment, customer engagement, and operation optimization. Through years of experience in the development and operation of machine learning models, we have established a systematic approach to model monitoring. They generalize into the following four questions:

  1. Is the model functioning as intended?
  2. Is the model still applied as designed for?
  3. Has the learned relationship changed?
  4. Is there any new signal emerging?

In the following blog, we’ll look at the answers to these questions, taking a binary probabilistic classification model (e.g. whether the customer is likely to stop repaying their loan) as an example. In this blog (part one), we’ll be focusing on questions 1 and 2.

It is worth mentioning here that we are only discussing the “statistical” monitoring, while the monitoring of the software service that hosts the model is a completely different beast.

  1. Is the model functioning as intended?

Typically, we would carry out an end-to-end acceptance test before a model is deployed to production. This is to check that on a large sample (typically the model development dataset), the implementation achieves identical results as from the model development setup. This would highlight any discrepancy in the data source, feature processing, model configuration, post-model transformation, and software environment.

After deployment, we would then convert the development setup into a “dual control” monitoring. For every invocation of the model in production, this setup would extract the input variables from ideally independent data sources and try to reproduce identical outcomes in an offline environment. This setup allows us to detect edge cases that are not presented in the acceptance test. It would also flag up any unintended changes from the data provider services or the software infrastructure.

  1. Is the model applied as designed for?

A machine learning model can only learn patterns from its training dataset. It is thus crucial to construct the training dataset as representative as possible to the target population in an application. However, the target population might change over time post-deployment, as market conditions evolve and businesses explore new opportunities. This could lead to two types of problems:

Unseen sub-population emerging

When the model is applied to data points not represented in the training dataset, the predictions are either restricted to the edge cases of the training sample (e.g. tree-based algorithms), or extrapolated without constraint (e.g. logistic regression, neural network). Given the nonlinear nature of most machine learning models, these would often lead to predictions far from the truth.

An obvious solution is to set up monitoring which explicitly detects outliers in the “application” dataset, which have at least one input variable outside of the range represented in the training data. However, as machine learning models are typically dealing with multiple input variables, it is more difficult to notice if an outlier is within the range of every dimension, but of an unseen combination.

For example, assume a credit risk model with two input variables: income and mortgage monthly payments. As these two variables are typically correlated due to lending policy of mortgage provider, the training dataset might see two types of customers: a lower risk group with good income and high mortgage monthly payments, and a riskier group with lower values in both. With the pandemic hitting the economy hard, the model might start encountering an increasing population of customers with high mortgage monthly but reduced income, either due to pay cuts or job loss. While the values of each variable are within seen ranges, the combination was rare in the training dataset. The model might naively predict an intermediate risk profile for this new sub-population, as one input variable is low while the other is high. While in reality, this particular group is likely under more financial stress than all the others.

To detect such hidden outliers in the multi-dimensional space, we set up monitoring using a k-nearest-neighbor classifier (kNN). Like most machine learning binary classifiers, this algorithm can give an estimation of probability between two classes based on multi-dimensional input. Yet it is more focused on estimation using local information. We train this model with a mixture of model training dataset and the application dataset, with the former labelled as 1 and the latter as 0. We can then use this classifier to estimate, for each data point in the application dataset, the probability that it is represented in the training dataset. If we have a few data points that have a very low estimated probability of being represented, chances are that we are looking at a new sub-population.

Change in population composition

Depending on the chosen algorithms and training sample, a machine learning model might have different performance across sub-populations. Sometimes this is by design, e.g. we might want our model to be most optimal for the higher credit utilization sub-population. Other times it’s the implicit results of bias/variance trade-off during model development, often driven by the compositions in the training dataset.

As market conditions change, the model might be applied more frequently in regions where it is less optimized for, e.g. when customers shifting towards lower credit utilization due to less spending during the lockdown. Alternatively, the business might be scaling back in risk appetite during this turbulent time, which in turn reduces the variance of customer profiles that the model learned to differentiate.

Although such population change does not intrinsically make the model less valid than when they were first developed, it would often lead to a decrease of model performance metrics such as ROC AUC or accuracy. To help understand it, let’s imagine the extreme case where we select a specific sub-sample of the training dataset, whose model scores are within an arbitrary narrow band. By construction, the model can tell very little difference across individuals (i.e. very low ROC AUC) within this sample of reduced scope, despite being optimized to differentiate them across the broader spectrum. In real life, the suppression of population composition is likely less drastic, while the impact on model performance metrics could still be quite significant.

Naturally, we would keep close monitoring of the univariate distributions for each input variable to detect any significant drift. For similar reasons described above, we can also monitor the distribution of the kNN probability between the training and application datasets to detect multi-dimensional shifts.

More interestingly, we can further use the kNN’s estimates as a reweighting on the training dataset, by converting the probabilistic prediction into the odds of the application dataset over the training dataset. Applying these weights in the computation of metrics (e.g. sample_weight argument in sklearn.metrics.roc_auc_score), we can estimate the impact of such distributional changes on model performance based on the labels of the training dataset. Thus, we can adjust our models promptly, without waiting months until the actual outcomes are observed. In addition, these weights can be used to retrain the model with the same training dataset, where the algorithm might shift the focus of optimization accordingly.

Next on machine learning model monitoring

This is the first part of my two blog posts on machine learning monitoring. We looked at how to detect and mitigate functionality degradation and changes of applicability. In the next blog (part 2), I will discuss how to detect drift of learned relationships, as well as emerging new signals. I will also discuss these topics with more detailed examples in ODSC Europe in September. Hope to meet you there (virtually).

Editor’s note: Dr. Jiahang Zhong is a speaker for ODSC Europe 2020. Check out his talk, “Can Your Model Survive the Crisis: Monitoring, Diagnosis and Mitigation,” there! In his session, he will share some experience of model monitoring and diagnosis from a leading UK fintech company.

About the author/ODSC Europe speaker:

Dr. Jiahang Zhong is the leader of the data science team at Zopa, one of the UK’s earliest fintech companies. He has broad experience in data science projects in credit risk, operational optimization, and marketing, with keen interests in machine learning, optimization algorithms, and big data technologies. Prior to Zopa, he worked as a PhD and Postdoctoral researcher on the Large Hadron Collider project at CERN, with a focus on data analysis, statistics, and distributed computing.