Abstract: The privacy risks of analyzing/sharing sensitive data about individuals have never been more apparent, due to increasingly sophisticated attacks that demonstrate that private information can be leaked even from aggregate statistics or trained models.
Differential privacy addresses these risks with a rigorous, mathematically proven model of privacy protection, which has broad applications, from analytics to machine learning to synthetic data. By adding carefully calibrated noise to statistical calculations, it is possible to ensure strong privacy and still enjoy statistically accurate results.
In this workshop, we introduce differential privacy and show how to do sql-like analytics using the Tumult Labs differential privacy platform. Tumult Analytics (https://www.tmlt.io/platform) is used by a variety of organizations -- including the US Census Bureau, US Internal Revenue Service, and Wikimedia -- to publicly share aggregate statistics about populations of interest. Tumult Analytics will be made open-source this year and is available now for trial users.
Module 1: Introduction to differential privacy (30 minutes)
In the first module, we motivate the need for strong privacy protection by looking at sophisticated attacks (reconstruction and membership inference attacks) that render conventional privacy protection methods ineffective. We then present differential privacy, describing what it means, where it can be used and how it protects against attacks that are both known and unknown. We provide a brief overview of existing open-source differential privacy tools.
Module 2: Analyzing data under differential privacy (30 minutes)
Next, we show using examples the key ideas behind accurately analyzing data under differential privacy. Key concepts we will cover include:
The concept of a privacy loss budget and privacy loss accounting
The interplay between noise and privacy loss
How queries must be modified to ensure differential privacy (and therefore reduce the risk of catastrophic privacy loss)
The focus will be on supporting workflows that involve statistical queries (counts, averages, medians) on structured tabular data, possibly transformed through operations such as group by, filter, join, map, etc. Examples will use the Tumult Analytics programming platform, but the ideas are applicable to any differentially private computation.
Module 3: Differentially private data release using Tumult Analytics
In the final module, we do a deeper dive into an example application where a data custodian wishes to release aggregate statistics with high accuracy and strong privacy guarantees. We will use the Tumult Analytics programming platform to build an end-to-end solution.
Bio: Ashwin Machanavajjhala is an Assistant Professor in the Department of Computer Science, Duke University and an Associate Director at the Information Initiative@Duke (iiD). Previously, he was a Senior Research Scientist in the Knowledge Management group at Yahoo! Research. His primary research interests lie in algorithms for ensuring privacy in statistical databases and augmented reality applications. He is a recipient of the National Science Foundation Faculty Early CAREER award in 2013, and the 2008 ACM SIGMOD Jim Gray Dissertation Award Honorable Mention. Ashwin graduated with a Ph.D. from the Department of Computer Science, Cornell University and a B.Tech in Computer Science and Engineering from the Indian Institute of Technology, Madras.