De-identification is the process of removing identifying information from a dataset. The term de-identification is sometimes used synonymously with the terms anonymization and pseudonymization.
After reading this chapter, you be able to:
Define the following concepts:
Identifying information / personally identifying information
Aggregation and aggregate statistics
Perform a linkage attack
Perform a differencing attack
Explain the limitations of de-identification techniques
Explain the limitations of aggregate statistics
Identifying information has no formal definition. It is usually understood to be information which would be used to identify us uniquely in the course of daily life - name, address, phone number, e-mail address, etc. As we will see later, it’s impossible to formalize the concept of identifying information, because all information is identifying. The term personally identifiable information (PII) is often used synonymously with identifying information.
How do we de-identify information? Easy - we just remove the columns that contain identifying information!
adult_data = adult.copy().drop(columns=['Name', 'SSN']) adult_pii = adult[['Name', 'SSN', 'DOB', 'Zip']] adult_data.head(1)
|DOB||Zip||Age||Workclass||fnlwgt||Education||Education-Num||Marital Status||Occupation||Relationship||Race||Sex||Capital Gain||Capital Loss||Hours per week||Country||Target|
We’ll save some of the identifying information for later, when we’ll use it as auxiliary data to perform a re-identification attack.
Another way to prevent the release of private information is to release only aggregate data.
Problem of Small Groups#
In many cases, aggregate statistics are broken down into smaller groups. For example, we might want to know the average age of people with a particular education level.
Aggregation is supposed to improve privacy because it’s hard to identify the contribution of a particular individual to the aggregate statistic. But what if we aggregate over a group with just one person in it? In that case, the aggregate statistic reveals one person’s age exactly, and provides no privacy protection at all! In our dataset, most individuals have a unique ZIP code - so if we compute the average age by ZIP code, then most of the “averages” actually reveal an individual’s exact age.
The US Census Bureau, for example, releases aggregate statistics at the block level. Some census blocks have large populations, but some have a population of zero! The situation above, where small groups prevent aggregation from hiding information about individuals, turns out to be quite common.
How big a group is “big enough” for aggregate statistics to help? It’s hard to say - it depends on the data and on the attack - so it’s challenging to build confidence that aggregate statistics are really privacy-preserving. However, even very large groups do not make aggregation completely robust against attacks, as we will see next.
The problems with aggregation get even worse when you release multiple aggregate statistics over the same data. For example, consider the following two summation queries over large groups in our dataset (the first over the whole dataset, and the second over all records except one):
adult[adult['Name'] != 'Karrie Trusslove']['Age'].sum()
If we know both answers, we can simply take the difference and determine Karrie’s age completely! This kind of attack can proceed even if the aggregate statistics are over very large groups.
adult['Age'].sum() - adult[adult['Name'] != 'Karrie Trusslove']['Age'].sum()
This is a recurring theme.
Releasing data that is useful makes ensuring privacy very difficult
Distinguishing between malicious and non-malicious queries is not possible
A linkage attack involves combining auxiliary data with de-identified data to re-identify individuals.
In the simplest case, a linkage attack can be performed via a join of two tables containing these datasets.
Simple linking attacks are surprisingly effective:
Just a single data point is sufficient to narrow things down to a few records
The narrowed-down set of records helps suggest additional auxiliary data which might be helpful
Two data points are often good enough to re-identify a huge fraction of the population in a particular dataset
Three data points (gender, ZIP code, date of birth) uniquely identify 87% of people in the US