Analyzing Sensitive Data Using Differential Privacy


The privacy risks of analyzing/sharing sensitive data about individuals have never been more apparent, due to increasingly sophisticated attacks that demonstrate that private information can be leaked even from aggregate statistics or trained models.

Differential privacy addresses these risks with a rigorous, mathematically proven model of privacy protection, which has broad applications, from analytics to machine learning to synthetic data. By adding carefully calibrated noise to statistical calculations, it is possible to ensure strong privacy and still enjoy statistically accurate results.

In this workshop, we introduce differential privacy and show how to do sql-like analytics using the Tumult Labs differential privacy platform. Tumult Analytics ( is used by a variety of organizations -- including the US Census Bureau, US Internal Revenue Service, and Wikimedia -- to publicly share aggregate statistics about populations of interest. Tumult Analytics will be made open-source this year and is available now for trial users.

Session Outline:
Module 1: Introduction to differential privacy (30 minutes)

In the first module, we motivate the need for strong privacy protection by looking at sophisticated attacks (reconstruction and membership inference attacks) that render conventional privacy protection methods ineffective. We then present differential privacy, describing what it means, where it can be used and how it protects against attacks that are both known and unknown. We provide a brief overview of existing open-source differential privacy tools.

Module 2: Analyzing data under differential privacy (30 minutes)

Next, we show using examples the key ideas behind accurately analyzing data under differential privacy. Key concepts we will cover include:

The concept of a privacy loss budget and privacy loss accounting

The interplay between noise and privacy loss

How queries must be modified to ensure differential privacy (and therefore reduce the risk of catastrophic privacy loss)

The focus will be on supporting workflows that involve statistical queries (counts, averages, medians) on structured tabular data, possibly transformed through operations such as group by, filter, join, map, etc. Examples will use the Tumult Analytics programming platform, but the ideas are applicable to any differentially private computation.

Module 3: Differentially private data release using Tumult Analytics

In the final module, we do a deeper dive into an example application where a data custodian wishes to release aggregate statistics with high accuracy and strong privacy guarantees. We will use the Tumult Analytics programming platform to build an end-to-end solution.


Michael Hay is an Associate Professor of Computer Science at Colgate University and founder/CTO of Tumult Labs, a startup that helps organizations safely release data using differential privacy. His research interests include data privacy, databases, data mining, machine learning, and social network analysis. He was previously a Research Data Scientist at the US Census Bureau and a Computing Innovation Fellow at Cornell University. He holds a Ph.D. from the University of Massachusetts Amherst and a bachelor's degree from Dartmouth College. His research is supported by grants from DARPA and NSF.

Open Data Science




Open Data Science
One Broadway
Cambridge, MA 02142

Privacy Settings
We use cookies to enhance your experience while using our website. If you are using our Services via a browser you can restrict, block or remove cookies through your web browser settings. We also use content and scripts from third parties that may use tracking technologies. You can selectively provide your consent below to allow such third party embeds. For complete information about the cookies we use, data we collect and how we process them, please check our Privacy Policy
Consent to display content from - Youtube
Consent to display content from - Vimeo
Google Maps
Consent to display content from - Google