Abstract: We propose a method to assess the sensitivity of data analyses to the removal of a small fraction of the data set. Our metric, which we call the Approximate Maximum Influence Perturbation, approximately computes the fraction of observations with the greatest influence on a given result when dropped. The exact computation is prohibitive; for example, even with a mere 100 data points and a 1-second data analysis, dropping every subset of 5 data points and re-running each analysis would take years. By contrast, our approximation runs in seconds for real data sets that are orders of magnitude larger. Our approximation is automatically computable and works for common estimators (including, but not limited to, ordinary least squares, instrumental variables, generalized method of moments, maximum likelihood, variational Bayes, and other estimators that minimize loss). At minimal additional computational cost, we provide an exact lower bound on sensitivity, so any non-robustness our metric finds is conclusive. We demonstrate that the Approximate Maximum Influence Perturbation is driven by a low signal-to-noise ratio in the data analysis, is not reflected in standard errors, does not disappear asymptotically, and is not a product of misspecification. We demonstrate our metric on influential applications in econometrics. While we find some applications are robust, in others the principal conclusion can be changed by dropping less than 1% of the sample even when standard errors are small. In one case, we identify a single data point out of over 16,500 that changes the conclusions of a data analysis.
Bio: Tamara Broderick is an Associate Professor in the Department of Electrical Engineering and Computer Science at MIT. She is a member of the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL), the MIT Statistics and Data Science Center, and the Institute for Data, Systems, and Society (IDSS). She completed her Ph.D. in Statistics at the University of California, Berkeley in 2014. Previously, she received an AB in Mathematics from Princeton University (2007), a Master of Advanced Study for completion of Part III of the Mathematical Tripos from the University of Cambridge (2008), an MPhil by research in Physics from the University of Cambridge (2009), and an MS in Computer Science from the University of California, Berkeley (2013). Her recent research has focused on developing and analyzing models for scalable Bayesian machine learning. She has been awarded an Early Career Grant (ECG) from the Office of Naval Research (2020), an AISTATS Notable Paper Award (2019), an NSF CAREER Award (2018), a Sloan Research Fellowship (2018), an Army Research Office Young Investigator Program (YIP) award (2017), Google Faculty Research Awards, an Amazon Research Award, the ISBA Lifetime Members Junior Researcher Award, the Savage Award (for an outstanding doctoral dissertation in Bayesian theory and methods), the Evelyn Fix Memorial Medal and Citation (for the Ph.D. student on the Berkeley campus showing the greatest promise in statistical research), the Berkeley Fellowship, an NSF Graduate Research Fellowship, a Marshall Scholarship, and the Phi Beta Kappa Prize (for the graduating Princeton senior with the highest academic average).