The suitability of Machine Learning models is traditionally measured on its accuracy. A highly accurate model based on metrics like RMSE, MAPE, AUC, ROC, Gini, etc are considered to be high performing models. While such accuracy metrics important, are there other metrics that the data science community has been ignoring so far? The answer is yes—in the pursuit of accuracy, most models sacrifice “fairness” and “interpretability.” Rarely, a data scientist tries to dissect a model to find out if the model follows all ethical norms. This is where machine learning fairness and interpretability of models come into being.
[Related Article: AI Ethics: Avoiding Our Big Questions]
There have been multiple instances when an ML model was found to discriminate against a particular section of society, be it rejecting female candidates during hiring, systemically disapproving loans to working women, or having a high rejection rate for darker color candidates. Recently, it was found that facial recognition algorithms that are available as open-source have lower accuracy on female faces with darker skin color than vice versa. In another instance, research by CMU showed how a Google ad showed an ad for high-income jobs to men more often than women.
Certain people are from protected categories. For instance, if a business differentiates against a person solely due to the fact that they are a person of color, it would be considered unethical and illegal. However, some ML models in banks today do exactly that, by having a feature encoding the race of each applicant. This is against the concept of fairness.
Machine Learning, as the name implies, learns whatever it is taught. It’s a ramification of what it is fed. It’s a fallacy that ML doesn’t have perspective; it has the same perspective that the data has, which was used to make it learn what it preaches today. In simple words, algorithms can echo prejudices that data explicitly or implicitly have.
It’s important for an organization to ensure models are fair and accountable. The first step towards this would be to understand the distribution of sensitive features (like age, gender, color, race, nationality) to the outcome features (default, reject, approve, high rate, etc).
In order to ensure fairness of models, some key metrics need to be defined. While there are many possible fairness metrics, the most important are Statistical Parity, Mean Difference, and Disparate Impact which can be used to quantify and measure bias or discrimination.
For instance, metrics like Statistical Parity reveals if the data in question is discrimination against an unprivileged class for a favorable outcome.
Using credit risk data where we wanted to predict the probability of someone defaulting on a loan, we were able to shortlist features that were discriminatory in nature. Using this risk data (risk calculated for over 61000 customers over 200 features), we see that around 6 – 8 sensitive features (marriage or otherwise, single or otherwise, house owner, mortgage, age group, ethnicity, primary language—English or otherwise, etc) scream discrimination. For illustration, we chose a discriminatory feature that indicates a candidate’s marital status, and by using a bias removal technique, like re-weighing on this feature, we were able to reduce bias from 0.13 points to 1.1 e-16. Using this generated fairness induced sample weights as a parameter to a logistic regression, we saw the overall accuracy difference between married and otherwise candidates reduced by 0.60%, while the overall accuracy increased by 0.13. The point to note here is that by using sample weights (generated using reweighing technique) the model accuracy difference (or discrimination) between two groups of a sensitive class decreased significantly, thus ensuring similar performance for both the group.
Diving deeper, it was visible that AUC difference (between fair algorithm and baseline algorithm), Gini difference, Precision, sensitivity difference (a.k.a. True positive rate or recall), FNR difference, and F1 score difference either stayed the same or improved by a few points. Bringing this to a monetary perspective (considering a penalty of 700 for false-positive and 300 for false negative), there was a significant drop in cost after introducing fairness, which is a win-win situation.
Talking about fairness metrics, we see that the difference in fairness metrics before and after introducing fairness was also quite promising. Fairness metrics like equality of opportunity, equality of odds, and demographic parity also improved in way that they induced fairness to the model. In almost all these metrics the performance in terms of being fair of the model was much better than before. The difference between all the important mathematical fairness and accuracy metrics before and after showed how models with fairness weights performed better.
In case we would opt out of their fairness weighing method or use an algorithm that doesn’t allow sample weights as model parameter and thus prefer to work on final prediction to make the outcome fair, we could calibrate the prediction probability threshold for optimal results using various techniques.
[Related Article: 9 Common Mistakes That Lead To Data Bias]
In the first instance, we decided to have same threshold for married candidates or otherwise to find the optimal result. In order to optimize this further, we checked the model performance (on accuracy, fairness metrics, and cost) across various thresholds, and we found that for thresholds between .50 to .80 we would get optimal results.
In another scenario, we had the flexibility of choosing different thresholds for married candidates and otherwise candidates in order to bring in fairness and get optimal results in terms of accuracy and cost. Here we see how the threshold for one class can be around .40 to .50 while a threshold between .30 and .40 for another group would be acceptable. The plot shows movement for two different thresholds for two different group across various fairness metrics and cost.
The three methods discussed can be used for removing bias and optimizing accuracy/cost for one single feature or a series of features. Given that there are instances of proxy variables (gender can be derived from income, race from zip code, marital status from purchasing pattern), a single method can tackle bias in multiple features. Also, a data scientist may want to create a Booleans feature describing multiple conditions (for instance, male & young and Female & Young are privileged while others are unprivileged) and use any of these techniques for bias detection and removal, and to promote machine learning fairness.