Abstract: As neural networks increasingly make critical decisions in high-stakes settings, monitoring and explaining their behavior in an understandable and trustworthy manner has become a necessity. One commonly used type of explainer is post hoc feature attribution, which is a family of different methods of giving to each feature in a model’s input a score corresponding to the feature’s influence on the model’s output. A major limitation of this family of explainers in practice is that they can disagree on which features are more important than others. Our contribution in this paper is a method of training models with this disagreement problem in mind. We do this by including in the loss function, alongside the standard term corresponding to model performance, an additional term that measures the difference in feature attribution between a pair of explainers. We observe in our experiments on three datasets that models trained with this loss term can see improved explanation consensus on unseen data and on explainers that were not explicitly trained to agree. However, this improved consensus comes with a cost to model performance. Finally, we study how our method influences model outputs and explanations.
Bio: Avi Schwarzschild is a research fellow at Arthur and a fifth-year PhD student in the Applied Math and Scientific Computation program at the University of Maryland. His work at Arthur focuses on explainability tools for neural networks. At the University of Maryland, he is advised by Tom Goldstein on his work in deep learning. His general interests range from security to generalization and interpretability and he is trying to expand our understanding of when and why neural networks work.