Differential Privacy#
Learning Objectives
After reading this chapter, you will be able to:
Define differential privacy
Explain the importance of the privacy parameter
Use the Laplace mechanism to enforce differential privacy for counting queries
Like
Definition 3 (Privacy Mechanism)
A function which satisfies differential privacy is often called a mechanism. We say that a mechanism
Two datasets are considered neighbors if they differ in the data of a single individual. Note that
The important implication of this definition is that
The
How should we set
Note
Why is
If
This definition might be more intuitive, especially if you have not studied probability theory.
The Laplace Mechanism#
Differential privacy is typically used to answer specific queries. Let’s consider a query on the census data, without differential privacy.
“How many individuals in the dataset are 40 years old or older?”
The easiest way to achieve differential privacy for this query is to add random noise to its answer. The key challenge is to add enough noise to satisfy the definition of differential privacy, but not so much that the answer becomes too noisy to be useful. To make this process easier, some basic mechanisms have been developed in the field of differential privacy, which describe exactly what kind of - and how much - noise to use. One of these is called the Laplace mechanism [4].
Definition 4 (Laplace Mechanism)
According to the Laplace mechanism, for a function
where
The sensitivity of a function
Thus we can achieve differential privacy for our example query by using the Laplace mechanism with sensitivity 1 and an random.laplace
.
You can see the effect of the noise by running this code multiple times. Each time, the output changes, but most of the time, the answer is close enough to the true answer (14,235) to be useful.
How Much Noise is Enough?#
How do we know that the Laplace mechanism adds enough noise to prevent the re-identification of individuals in the dataset? For one thing, we can try to break it! Let’s write down a malicious counting query, which is specifically designed to determine whether Karrie Trusslove has an income greater than $50k.
This result definitely violates Karrie’s privacy, since it reveals the value of the income column for Karrie’s row. Since we know how to ensure differential privacy for counting queries with the Laplace mechanism, we can do so for this query:
Is the true answer 0 or 1? There’s too much noise to be able to reliably tell. This is how differential privacy is intended to work - the approach does not reject queries which are determined to be malicious; instead, it adds enough noise that the results of a malicious query will be useless to the adversary.
The Unit of Privacy#
The typical definition of differential privacy defines neighboring datasets as any two datasets that differ in “one person’s data.” It’s often difficult or impossible to figure out how much data “belongs” to which person.
The unit of privacy refers to the formal definition of “neighboring” used in a differential privacy guarantee. The most common unit of privacy is “one person” - meaning the privacy guarantee protects the whole person, forever. But other definitions are possible; Apple’s implementation of differential privacy, for example, uses a “person-day” unit of privacy, meaning that the guarantee applies to the data submitted by one person on a single day.
The unit of privacy can result in surprising privacy failures. For example, in Apple’s system, the differential privacy guarantee does not protect trends in the data occurring across multiple days - even for individual people. If a person submits identical data for 365 days in a row, then differential privacy provides essentially no protection for that data.
The “one person” unit of privacy is a good default, and usually avoids surprises. Other units of privacy are usually used to make it easier to get accurate results, or because it’s hard to tie specific data values to individual people.
It’s common to make a simplifying assumption that makes it easy to formalize the definition of neighboring datasets:
Each individual’s data is contained in exactly one row of the data
If this assumption is true, then it’s possible to define neighboring datasets formally, in terms of the format of the data (see below), and retain the desired “one person” unit of privacy. When it’s not true, the best solution is to transform the data and queries in order to achieve the “one person” unit of privacy. It’s best to avoid using a different unit of privacy whenever possible.
Bounded and Unbounded Differential Privacy#
Under the “one person = one row” simplification, neighboring datasets differ in one row. What does “differ” mean? There are two ways to define that, too! Here are the two formal definitions:
Definition 5 (Unbounded Differential Privacy)
Under unbounded differential privacy, two datasets
Definition 6 (Bounded Differential Privacy)
Under bounded differential privacy, two datasets
Summary
Differential privacy is a property of algorithms, and not a property of data.
A function which satisfies differential privacy is often called a mechanism.
The easiest way to achieve differential privacy for this function is to add random noise to its answer.
The unit of privacy refers to the formal definition of “neighboring” used in a differential privacy guarantee. The most common unit of privacy is “one person” - meaning the privacy guarantee protects the whole person, forever.