Abstract: The privacy risks of analyzing/sharing sensitive data about individuals have never been more apparent, due to increasingly sophisticated attacks that demonstrate that private information can be leaked even from aggregate statistics or trained models.
Differential privacy addresses these risks with a rigorous, mathematically proven model of privacy protection, which has broad applications, from analytics to machine learning to synthetic data. By adding carefully calibrated noise to statistical calculations, it is possible to ensure strong privacy and still enjoy statistically accurate results.
In this workshop, we introduce differential privacy and show how to do sql-like analytics using the Tumult Labs differential privacy platform. Tumult Analytics (https://www.tmlt.io/platform) is used by a variety of organizations -- including the US Census Bureau, US Internal Revenue Service, and Wikimedia -- to publicly share aggregate statistics about populations of interest. Tumult Analytics will be made open-source this year and is available now for trial users.
Module 1: Introduction to differential privacy (30 minutes)
In the first module, we motivate the need for strong privacy protection by looking at sophisticated attacks (reconstruction and membership inference attacks) that render conventional privacy protection methods ineffective. We then present differential privacy, describing what it means, where it can be used and how it protects against attacks that are both known and unknown. We provide a brief overview of existing open-source differential privacy tools.
Module 2: Analyzing data under differential privacy (30 minutes)
Next, we show using examples the key ideas behind accurately analyzing data under differential privacy. Key concepts we will cover include:
The concept of a privacy loss budget and privacy loss accounting
The interplay between noise and privacy loss
How queries must be modified to ensure differential privacy (and therefore reduce the risk of catastrophic privacy loss)
The focus will be on supporting workflows that involve statistical queries (counts, averages, medians) on structured tabular data, possibly transformed through operations such as group by, filter, join, map, etc. Examples will use the Tumult Analytics programming platform, but the ideas are applicable to any differentially private computation.
Module 3: Differentially private data release using Tumult Analytics
In the final module, we do a deeper dive into an example application where a data custodian wishes to release aggregate statistics with high accuracy and strong privacy guarantees. We will use the Tumult Analytics programming platform to build an end-to-end solution.
Bio: Michael Hay is an Associate Professor of Computer Science at Colgate University and founder/CTO of Tumult Labs, a startup that helps organizations safely release data using differential privacy. His research interests include data privacy, databases, data mining, machine learning, and social network analysis. He was previously a Research Data Scientist at the US Census Bureau and a Computing Innovation Fellow at Cornell University. He holds a Ph.D. from the University of Massachusetts Amherst and a bachelor's degree from Dartmouth College. His research is supported by grants from DARPA and NSF.