Abstract: As neural networks increasingly make critical decisions in high-stakes settings, monitoring and explaining their behavior in an understandable and trustworthy manner has become a necessity. One commonly used type of explainer is post hoc feature attribution, which is a family of different methods of giving to each feature in a model’s input a score corresponding to the feature’s influence on the model’s output. A major limitation of this family of explainers in practice is that they can disagree on which features are more important than others. Our contribution in this paper is a method of training models with this disagreement problem in mind. We do this by including in the loss function, alongside the standard term corresponding to model performance, an additional term that measures the difference in feature attribution between a pair of explainers. We observe in our experiments on three datasets that models trained with this loss term can see improved explanation consensus on unseen data and on explainers that were not explicitly trained to agree. However, this improved consensus comes with a cost to model performance. Finally, we study how our method influences model outputs and explanations.
Bio: Max Cembalest is a researcher at Arthur focused on simplifying and explaining machine learning models. Previously, he received an M.S. in Data Science from Harvard University, where he concentrated on interpretability and graph-based models. He is particularly excited about recent advances in applying abstract algebra, topology, and category theory to neural network design.
Machine Learning Engineer | Arthur