Abstract: Tuning a model is a core element of a data scientist’s work. It is often very difficult, requiring both experience and expertise to do effectively. An important and integral part of the model tuning process is the feature selection process. This is because in many cases, the model itself is a ‘black box’, which makes it hard to understand features' performance.
In one case, we had created a working model and we decided to update our test data. That is when everything fell apart. Inexplicably, model accuracy dropped instantly. To fix the problem, we first tried looking at the misclassifications but we saw nothing remarkable. They all looked different from each other. Next, we calculated the feature contributions of the model’s predictions and performed analysis on top of it. This process helped us find the trouble-making features. We’ll show how we did it by describing our experience with tuning a model using analytics on the features’ contribution data.
Data scientists choose the features sent to the model. They can add, remove or change features. Normally each feature has both a good and a bad impact on the model’s performance, which depends on the input. As data scientists, we would like to know if a feature is good or bad and how its performance changes in different test sets.
The feature contribution data, also referred to as explainability, can be calculated in various ways. For example, you can use the SHAP or LIME python libraries, which are both model-agnostic.
The process we suggest is simple and consists of two main steps:
Calculate and keep the feature contribution data for different experiments and test sets
Analyze the data using a query engine and analytics tools
The advantage of this approach is you gain the ability to analyze the data. Unlike the feature vector input, the feature contribution data is simple - arrays of floating point numbers. For example, you can calculate the average contribution of a feature in a test set to decide if it is helpful or not. You can work with a lot of data, and decide which scoring method is good for you.
We believe that knowing more about the contributions will help learning both about your model and your data. That will lead to better results, and easier and faster to achieve them. We will explain the feature contributions calculation, show examples and share our experience using it in a model tuning process.
Bio: Bio Coming Soon!
Principal Engineer, Threat Research | Imperva