{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# De-identification\n", "\n", "```{admonition} Learning Objectives\n", "After reading this chapter, you be able to:\n", "- Define the following concepts:\n", " - De-identification\n", " - Re-identification\n", " - Identifying information / personally identifying information\n", " - Linkage attacks\n", " - Aggregation and aggregate statistics\n", " - Differencing attacks\n", "- Perform a linkage attack\n", "- Perform a differencing attack\n", "- Explain the limitations of de-identification techniques\n", "- Explain the limitations of aggregate statistics\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Preliminary\n", "\n", "Download the dataset by clicking [here](https://github.com/uvm-plaid/programming-dp/raw/master/notebooks/adult_with_pii.csv) and placing them in the same directory as this notebook.\n", "\n", "The dataset is based on census data. The personally identifiable information (PII) is made up." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameDOBSSNZipWorkclassEducationMarital StatusOccupationRelationshipRaceSexHours per weekCountryTargetAge
0Karrie Trusslove9/7/1967732-14-611064152State-govBachelorsNever-marriedAdm-clericalNot-in-familyWhiteMale40United-States<=50K56
1Brandise Tripony6/7/1988150-19-276661523Self-emp-not-incBachelorsMarried-civ-spouseExec-managerialHusbandWhiteMale13United-States<=50K35
2Brenn McNeely8/6/1991725-59-986095668PrivateHS-gradDivorcedHandlers-cleanersNot-in-familyWhiteMale40United-States<=50K32
3Dorry Poter4/6/2009659-57-497425503Private11thMarried-civ-spouseHandlers-cleanersHusbandBlackMale40United-States<=50K14
4Dick Honnan9/16/1951220-93-381175387PrivateBachelorsMarried-civ-spouseProf-specialtyWifeBlackFemale40Cuba<=50K72
\n", "
" ], "text/plain": [ " Name DOB SSN Zip Workclass \\\n", "0 Karrie Trusslove 9/7/1967 732-14-6110 64152 State-gov \n", "1 Brandise Tripony 6/7/1988 150-19-2766 61523 Self-emp-not-inc \n", "2 Brenn McNeely 8/6/1991 725-59-9860 95668 Private \n", "3 Dorry Poter 4/6/2009 659-57-4974 25503 Private \n", "4 Dick Honnan 9/16/1951 220-93-3811 75387 Private \n", "\n", " Education Marital Status Occupation Relationship Race \\\n", "0 Bachelors Never-married Adm-clerical Not-in-family White \n", "1 Bachelors Married-civ-spouse Exec-managerial Husband White \n", "2 HS-grad Divorced Handlers-cleaners Not-in-family White \n", "3 11th Married-civ-spouse Handlers-cleaners Husband Black \n", "4 Bachelors Married-civ-spouse Prof-specialty Wife Black \n", "\n", " Sex Hours per week Country Target Age \n", "0 Male 40 United-States <=50K 56 \n", "1 Male 13 United-States <=50K 35 \n", "2 Male 40 United-States <=50K 32 \n", "3 Male 40 United-States <=50K 14 \n", "4 Female 40 Cuba <=50K 72 " ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "adult = pd.read_csv(\"adult_with_pii.csv\")\n", "adult.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## De-identification\n", "\n", "*De-identification* is the process of removing *identifying information* from a dataset. The term *de-identification* is sometimes used synonymously with the terms *anonymization* and *pseudonymization*.\n", "\n", "Identifying information has no formal definition. It is usually understood to be information which would be used to identify us uniquely in the course of daily life - name, address, phone number, e-mail address, etc. As we will see later, it's *impossible* to formalize the concept of identifying information, because *all* information is identifying. The term *personally identifiable information (PII)* is often used synonymously with identifying information.\n", "\n", "How do we de-identify information? Easy - we just remove the columns that contain identifying information!" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DOBZipWorkclassEducationMarital StatusOccupationRelationshipRaceSexHours per weekCountryTargetAge
09/7/196764152State-govBachelorsNever-marriedAdm-clericalNot-in-familyWhiteMale40United-States<=50K56
\n", "
" ], "text/plain": [ " DOB Zip Workclass Education Marital Status Occupation \\\n", "0 9/7/1967 64152 State-gov Bachelors Never-married Adm-clerical \n", "\n", " Relationship Race Sex Hours per week Country Target Age \n", "0 Not-in-family White Male 40 United-States <=50K 56 " ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "adult_data = adult.copy().drop(columns=['Name', 'SSN'])\n", "adult_pii = adult[['Name', 'SSN', 'DOB', 'Zip']]\n", "adult_data.head(1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll save some of the identifying information for later, when we'll use it as *auxiliary data* to perform a *re-identification* attack." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Linkage Attacks\n", "\n", "Imagine we want to determine the income of a friend from our de-identified data. Names have been removed, but we happen to know some auxiliary information about our friend. Our friend's name is Karrie Trusslove, and we know Karrie's date of birth and zip code." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To perform a simple *linkage attack*, we look at the overlapping columns between the dataset we're trying to attack, and the auxiliary data we know. In this case, both datasets have dates of birth and zip codes. We look for rows in the dataset we're attacking with dates of birth and zip codes that match Karrie's date of birth and zip code. In databases, this is called a *join* of two tables, and we can do it in Pandas using `merge`. If there is only one such row, we've found Karrie's row in the dataset we're attacking." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameSSNDOBZipWorkclassEducationMarital StatusOccupationRelationshipRaceSexHours per weekCountryTargetAge
0Karrie Trusslove732-14-61109/7/196764152State-govBachelorsNever-marriedAdm-clericalNot-in-familyWhiteMale40United-States<=50K56
\n", "
" ], "text/plain": [ " Name SSN DOB Zip Workclass Education \\\n", "0 Karrie Trusslove 732-14-6110 9/7/1967 64152 State-gov Bachelors \n", "\n", " Marital Status Occupation Relationship Race Sex Hours per week \\\n", "0 Never-married Adm-clerical Not-in-family White Male 40 \n", "\n", " Country Target Age \n", "0 United-States <=50K 56 " ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "karries_row = adult_pii[adult_pii['Name'] == 'Karrie Trusslove']\n", "pd.merge(karries_row, adult_data, left_on=['DOB', 'Zip'], right_on=['DOB', 'Zip'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Indeed, there is only one row that matches. We have used auxiliary data to re-identify an individual in a de-identified dataset, and we're able to infer that Karrie's income is less than $50k." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### How Hard is it to Re-Identify Karrie?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This scenario is made up, but linkage attacks are surprisingly easy to perform in practice. How easy? It turns out that in many cases, just one data point is sufficient to pinpoint a row!" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameSSNDOB_xZipDOB_yWorkclassEducationMarital StatusOccupationRelationshipRaceSexHours per weekCountryTargetAge
0Karrie Trusslove732-14-61109/7/1967641529/7/1967State-govBachelorsNever-marriedAdm-clericalNot-in-familyWhiteMale40United-States<=50K56
\n", "
" ], "text/plain": [ " Name SSN DOB_x Zip DOB_y Workclass \\\n", "0 Karrie Trusslove 732-14-6110 9/7/1967 64152 9/7/1967 State-gov \n", "\n", " Education Marital Status Occupation Relationship Race Sex \\\n", "0 Bachelors Never-married Adm-clerical Not-in-family White Male \n", "\n", " Hours per week Country Target Age \n", "0 40 United-States <=50K 56 " ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.merge(karries_row, adult_data, left_on=['Zip'], right_on=['Zip'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So ZIP code is sufficient **by itself** to allow us to re-identify Karrie. What about date of birth?" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameSSNDOBZip_xZip_yWorkclassEducationMarital StatusOccupationRelationshipRaceSexHours per weekCountryTargetAge
0Karrie Trusslove732-14-61109/7/19676415264152State-govBachelorsNever-marriedAdm-clericalNot-in-familyWhiteMale40United-States<=50K56
1Karrie Trusslove732-14-61109/7/19676415267306Private11thWidowedFarming-fishingUnmarriedWhiteFemale40United-States<=50K56
2Karrie Trusslove732-14-61109/7/19676415262254Self-emp-not-incMastersMarried-civ-spouseExec-managerialHusbandWhiteMale50United-States>50K56
\n", "
" ], "text/plain": [ " Name SSN DOB Zip_x Zip_y Workclass \\\n", "0 Karrie Trusslove 732-14-6110 9/7/1967 64152 64152 State-gov \n", "1 Karrie Trusslove 732-14-6110 9/7/1967 64152 67306 Private \n", "2 Karrie Trusslove 732-14-6110 9/7/1967 64152 62254 Self-emp-not-inc \n", "\n", " Education Marital Status Occupation Relationship Race \\\n", "0 Bachelors Never-married Adm-clerical Not-in-family White \n", "1 11th Widowed Farming-fishing Unmarried White \n", "2 Masters Married-civ-spouse Exec-managerial Husband White \n", "\n", " Sex Hours per week Country Target Age \n", "0 Male 40 United-States <=50K 56 \n", "1 Female 40 United-States <=50K 56 \n", "2 Male 50 United-States >50K 56 " ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.merge(karries_row, adult_data, left_on=['DOB'], right_on=['DOB'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This time, there are three rows returned - and we don't know which one is the real Karrie. But we've still learned a lot!\n", "\n", "- We know that there's a 2/3 chance that Karrie's income is less than $50k\n", "- We can look at the differences between the rows to determine what additional auxiliary information would *help* us to distinguish them (e.g. sex, occupation, marital status)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Is Karrie Special?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How hard is it to re-identify others in the dataset? Is Karrie especially easy or especially difficult to re-identify? A good way to gauge the effectiveness of this type of attack is to look at how \"selective\" certain pieces of data are - how good they are at narrowing down the set of potential rows which may belong to the target individual. For example, is it common for birthdates to occur more than once?\n", "\n", "We'd like to get an idea of how many dates of birth are likely to be useful in performing an attack, which we can do by looking at how common \"unique\" dates of birth are in the dataset. The histogram below shows that *the vast majority* of dates of birth occur 1, 2, or 3 times in the dataset, and *no date of birth* occurs more than 8 times. This means that date of birth is fairly *selective* - it's effective in narrowing down the possible records for an individual." ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "tags": [ "hide-input" ] }, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYcAAAEGCAYAAACO8lkDAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMCwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy86wFpkAAAACXBIWXMAAAsTAAALEwEAmpwYAAAhSUlEQVR4nO3de5xVdb3/8dfbGyEKSuAopxSvecMLTGp5acqOp1+m9dPfjwdidsxwtHPwkugR7UjiqQ5Z1Em7nEN1lBTpwq+HdH6Yx1S2WpkXePgDSVBK1DQlUZwGE0Q/vz/WGtnOmpm9mJm19yx4Px+Pebju671ncH/2+n7X/i5FBGZmZtW2aXQAMzMbeFwczMwsw8XBzMwyXBzMzCzDxcHMzDJcHMzMLGO7RgfoDyNGjIjRo0f3ev9169YxZMiQ/gtUoDJlhXLlLVNWKFfeMmWFcuXtS9ZFixa9GBEju1wZEaX/GTduXPTFwoUL+7R/PZUpa0S58pYpa0S58pYpa0S58vYlK/BwdPO+6mYlMzPLKKRZSdK2wDTgBeBg4FpgF2AC0AasjIh5knYApgNPAXsB0yLidUkTgD2B4cBNEbGsiJxmZta1ovocxgC7R8QXJB0FnAZ8BDg9ItZJWiDpdpJisTwiZktqBcZLug2YGBGnShoKzAFOKSinmZl1oahmpeXAkZIOBo4EfgHsGhHr0vVPAocDLcDD6bLF6fw4YAVARLQBoyS5+cvMrI4KuXKIiNckXQLMBe4DbgfaqzZpA3YDRqbT3S0DWA8MA16uPkd6pdEK0NTURKVS6XXe9vb2Pu1fT2XKCuXKW6asUK68ZcoK5cpbVNai+hwOJGlGOgI4G5gK7Fy1yTDgeWB1Ov1Mp2XjqrYdBKztfI6ImAXMAmhubo6WlpZe561UKvRl/3oqU1YoV94yZYVy5S1TVihX3qKyFtVccxLweHqr1C3AocAaSR034+4NLAUWAs3psrHAPcAi4ACAtM/hufQ4ZmZWJ0V1SP8YuEzSucA+wNXAn4ErJa0FboiIdkk3A9MlTQJGs+lupbmSpgAjSK46zMysjorqc3gBuLSLVUs6bbcBuKKL/ecWkas7S599hbOnLqjnKQFYNePkup/TzCwP3wVkZmYZLg5mZpbh4mBmZhkuDmZmluHiYGZmGS4OZmaW4eJgZmYZLg5mZpbh4mBmZhkuDmZmluHiYGZmGS4OZmaW4eJgZmYZLg5mZpbh4mBmZhkuDmZmluHiYGZmGS4OZmaW4eJgZmYZhTxDWtJEYGLVoh2Bi4EJQBuwMiLmSdoBmA48BewFTIuI1yVNAPYEhgM3RcSyInKamVnXCikOEXELcAuApN2B04FrgdMjYp2kBZJuJykWyyNitqRWYLyk24CJEXGqpKHAHOCUInKamVnX6tGsdB4wD9g1Italy54EDgdagIfTZYvT+XHACoCIaANGSXLzl5lZHRVy5dBB0s7A7iRFqL1qVRuwGzAyne5uGcB6YBjwcqdjtwKtAE1NTVQqlV7nbBoMU8Zs7PX+vdWbzO3t7X16rfVWprxlygrlylumrFCuvEVlLbQ4AJOAnwJrgJ2rlg8DngdWp9PPdFo2rmrbQcDazgeOiFnALIDm5uZoaWnpdcjr58xn5tKifxVZq85s2ex9KpUKfXmt9VamvGXKCuXKW6asUK68RWUtrLlG0vbAx4BKRGwA1kgakq7eG1gKLASa02VjgXuARcAB6TGGAs9FRBSV08zMsor8uDweuCMi3kznLweulLQWuCEi2iXdDEyXNAkYzaa7leZKmgKMAKYWmNHMzLpQWHGIiDmd5pcASzot2wBc0cW+c4vKZWZmtfkuIDMzy3BxMDOzDBcHMzPLcHEwM7MMFwczM8twcTAzswwXBzMzy3BxMDOzDBcHMzPLcHEwM7MMFwczM8twcTAzswwXBzMzy8hVHCTtI+mdkoZIapW0d9HBzMyscfJeOZxF8iS3q4GhwAVFBTIzs8bLWxx+T/IM5yMj4mvAsuIimZlZo+UtDk3AbcDXJe1P8tQ2MzPbQuV6ElxEzJT07Yh4TVIT8P08+0naFjga2BgRD/Yhp5mZ1VGu4pBeLUyTtByYCewBPFVjn3cB5wOzI+IJSe8m6at4Fng9Ir4jScA0YDWwD/AvEdEm6UPA8cC2wJ0RcW/vXp6ZmfVG3mali4DPA2si4jXgAz1tLGkb4CvAjIh4Il38ZeA7EfFN4BhJo4ETASLiu8AvgQskbQf8M3ANSQf41enxzMysTvK+6f4hIp4GXkrnR9XY/kRAJG/2/yHpAODQiFiVrl8CHAu0AA+nyxan86NJilBExJvAqyRXKmZmVie5mpWAUZJOA3aTdCq1i8NhwJKImCHpYGA2MKhqfRuwGzAyne5uWfXyZ6tPIKkVaAVoamqiUqnkfClZTYNhypiNvd6/t3qTub29vU+vtd7KlLdMWaFcecuUFcqVt6iseYvDVcDlQDPwO+DCGtu/BuyUTj9G8sn/par1w4A/AsPT6Y5lz5P0PwzrtO3znU8QEbOAWQDNzc3R0tKS86VkXT9nPjOX5v1V9J9VZ7Zs9j6VSoW+vNZ6K1PeMmWFcuUtU1YoV96isuZtVnoN+HZEfAz4Em//ZN+Ve0muHgB2AV4Elqb9DKTrfg0sJCk4AGOBe0g6uocrsQ0wGHghZ04zM+sHeT8uX0PSNzAfWAd8FvhWdxtHxFJJ90u6gKRJ6AKSK4XJkp4BfhsRT6fTx0s6FziA5G6ljZK+SNIpvS1wddr3YGZmdZK3OCyPiPkA6Zv3O2vtkH6TurPLO20TwPQu9r0buDtnNjMz62d5m5UOTL/QhqShwOHFRTIzs0bLe+VwK3CvpI3ACOAfC0tkZmYNl3f4jEXAsZJGAGvwcyDMzLZoue/flHQgSecywEkkHcZmZrYFyju20q9IbiddnS46sLBEZmbWcHmvHBZFxEUdM5L2KSiPmZkNAHn7Du6XtEvV/F4FZDEzswEi75XDZOCa9G4lkXzr2YPhmZltofIWh1kR8cOOGUkthaQxM7MBIe+trD9MH8DzGrAIeKjQVFuJ0VMXbPY+U8Zs5Oxe7NfZqhkn9/kYZrblytXnIGkGcBxwSkSsBz5VaCozM2uovB3SqyPiGuCRdH5EMXHMzGwgyFscRksaBET630MKzGRmZg2Wt0N6NnAfyQN8pgKXFpbIzMwaLm9xeDUijuoYWykdatvMzLZQeZuVrgOIiBddGMzMtny5i4OkcR0zkj5RTBwzMxsI8jYrTQH2kPQGm74hfWtBmczMrMHyFofvRcScjhlJJ9TaQdIdwIZ0dgXwbyTPkn4WeD0iviNJwDSS0V73IXmGdFv6hbvjSZ4hfWdE3Jszp5mZ9YO8xeFY4K3ikPPN+paIuLFjRtJNwFURsUrSDyXdBuyXHu+7kk4CLpD0FZJnRZxIcpVyp6QPR8SbObOamVkf5e1z+Gv1jKQ8o7KOlTRd0lcl7QYcGhGr0nVLSApOC/BwumxxOj+a9I6otCC8igf5MzOrq7xXDn+QNA94MJ1/L/C/e9ohIi4EkHQs8ANgUNXqNpKnyo1Mp7tbVr382ZxZzcysj/IWh+OA/6qaX5v3BBHxa0n78farj2HAH4Hh6XTHsudJ+h+Gddr2+c7HldQKtAI0NTVRqVTyRspoGpwMaFcG/ZW1L7+vzdHe3l63c/VVmbJCufKWKSuUK29RWfMWh0si4k8dM5Ju7WljSSeTfHFuoaThJIXgOUmj06alw4Afkzx69ATg/wJjgXuAp4DhaWe1gMHpdm8TEbOAWQDNzc3R0tKS86VkXT9nPjOX5n6cdkNNGbOxX7KuOrOl72FyqFQq9OVvU09lygrlylumrFCuvEVlrfkuI2kbYH9J+1ctngic38NuvwIukvRu4FCShwW9CkyW9Azw24h4Op0+XtK5wAEkdyttlPRFkk7pbYGr3RltZlZfeT6CCvg88Ot0uonkjb5bEfEKcE0Xqy7vtF0A07vY/27g7hzZzMysADWLQ0S8IemciHirQ1jSVcXGMjOzRsp1K2unwrAtcFRhiczMrOHy9DnsADwAvEzSrDQEmFdwLjMza6A8zUobJP1TRPyyHoHMzKzx8n5D+hFJRwFIGi7J31g2M9uC5S0OF5DeoRQRL5HcympmZluovMXhDxHxaNX8O4sIY2ZmA0Pe4nCgpIMlbSPpeJIvrJmZ2RYq7zgMX0t/xgK/Ay4pLJGZmTVcruIQES9KOi8i1ksaFBHriw5mZmaNk6tZSdI/s2mI7u0lnVNcJDMza7S8zUorI+JHABHR7ltZzcy2bHk7pPfrmEiHzzi8mDhmZjYQ5L1yeFDSXcAzJIXhy8VFMjOzRsvbIX2HpPuB95B85+GlYmOZmVkj5Rl4byTJN6L3BVYCTxYdyszMGqvHPgdJhwG/AEYCy4DdgAWSDq5DNjMza5BaVw6fBj4YEX/pWCDpK8AXgEuLDGZmZo1T626lp6oLA0A6/8fiIpmZWaPVunIY2s3yYXkOLmkn4KGIOChtopoAtJF8b2Je+iCh6cBTwF7AtIh4XdIEYE9gOHBTRCzLcz4zM+sftYrDU5K+BnyH5GrhXcBngSdqHViSgLOBjjubrgVOj4h1khZIup2kWCyPiNmSWoHxkm4DJkbEqZKGAnOAU3rx2szMrJd6bFaKiNnAI8DPgDXpfx+LiFk5jn0asAB4XdIgYNeIWJeue5Lk+xItwMPpssXp/DhgRXr+NmCUpLxf1jMzs36Q5zGhNwM3b85BJR0KrI+IJ5MLCIYD7VWbtJHc+TQyne5uGcB6kmaslzudoxVoBWhqaqJSqWxOxLdpGgxTxmzs9f711F9Z+/L72hzt7e11O1dflSkrlCtvmbJCufIWlTXvN6Q313hgB0nHAfsAlwHvrlo/DHgeWJ1OP9Np2biqbQcBazufIL16mQXQ3NwcLS0tvQ57/Zz5zFxa1K+if00Zs7Ffsq46s6XvYXKoVCr05W9TT2XKCuXKW6asUK68RWXt9l1G0kci4vbeHDQiplUd55iIuETSQZKGpE1LewNLgYVAM/AoybMi7gEWARel+w4FnouI6E0OMzPrnZ7a8s/umJD0oeoVkgb34lyXA1dKugy4ISLaSZqr3iNpEjAa+GlErAXmSpoCXAFM7cW5zMysD3pqn1go6QiSzuGjJFU/Q/oM4Jt5ThARLel/lwBLOq3bQFIAOu8zN8+xzcysGD0Vh+8DZwGfAf4OOAlQum4UOYuDmZmVT7fFISLeAG4EbpT04Yi4s2OdpBPqkM3MzBok75Ddd0ral+S7CY9GxL3FxjIzs0bKVRwknUXStPQE8AlJCyPihkKTmZlZw+S9YX54RHyyY0bShQXlMTOzASDvsBTra8ybmdkWJO+Vw2BJXwaWA/sDrxQXyczMGi1vh/Q3JJ0IHAncFxF3FBvLzMwaKfcgPRFxF3BXgVnMzGyA8FDYZmaWkas4SDqo6CBmZjZw5L1yuK7QFGZmNqDkLg6S3nrGgqRPFBPHzMwGgrwd0lOAPSS9QTL43i7ArQVlMjOzBstbHL4XEXM6ZjzwnpnZli3v9xzmpA/8eY3kSW2LCk1lZmYNlfdupRnAccApEbEe+FShqczMrKHydkivjohrgEfS+RHFxDEzs4Egb5/DaEmDgEj/e0hPG0saAlxFMsT3icDXgI3ABKANWBkR8yTtAEwHngL2AqZFxOuSJgB7AsOBmyJi2ea/NDMz6628xWE2cB+wEzAVuLTG9nsAiyPiJ5IeBy4DdgVOj4h1khZIup2kWCyPiNmSWoHxkm4DJkbEqZKGAnOAUzb/pZmZWW/l7ZBeBBwlaQSwJiKixvYrgZXp7J7Ab0ne8Nely54keapcC/Cv6bLFwHnAC8CK9DhtkkZJ2iYi3sz9qszMrE/yPgnuIGAGsA+wTNIVEfFkjX1GAFcDg9P/nlq1ug3YDRiZTne3DJJnRwwDXu50/FagFaCpqYlKpZLnpXSpaTBMGbOx1/vXU39l7cvva3O0t7fX7Vx9VaasUK68ZcoK5cpbVNa8zUrTga8CD5I8z+ELwNk97RARLwKTJbWk++9ctXoY8DywOp1+ptOycVXbDgLWdnH8WcAsgObm5mhpacn5UrKunzOfmUtzD1DbUFPGbOyXrKvObOl7mBwqlQp9+dvUU5myQrnylikrlCtvUVnz3q30m4j4VURsSDuHf9fTxpKOkdTRab0KOBRYk3ZUA+wNLAUWAs3psrHAPSTfoTggPc5Q4LlazVhmZta/uv0IKmlvkiYegCGSjgHeJBk+o5bVwKclPQscBVwMtANXSloL3BAR7ZJuBqZLmgSMZtPdSnMlTSG5ZXZqr16ZmZn1Wk/tE39Pcstqezq/X9W6I4Fru9sxIv5AcisrwL9XrVrSabsNwBVd7D+3h1xmZlawnorDT4BVEfFq5xXpVYSZmW2hui0OEdFTv8JOBWQxM7MBIu+trF8ETuuYJRmye4+CMpmZWYPlvSfyEGBMRLwBIOmDxUUyM7NGy1scHuooDKmniwhj9TN66oK6nGfKmI2cXXWuVTNOrst5zaxv8haHlZJuBV4iaVban2QIbzMz2wLlLQ7nk3zf4IV03gPhmZltwfIWh7si4sGOGUlzetrYzMzKLW9x2EHStcCL6XwzML6YSGZm1mh5i8N7gF9Uzb9SQBYzMxsg8haHz0XEnzpmJM0vKI+ZmQ0ANYuDpG2A/SXtX7V4IkkntZmZbYHyXDkI+Dzw63S6CciMt2RmZluOmsUhIt6QdE5EPNuxTNJVPe1jZmblluthP50Kw7Ykz2gwM7MtVJ4+hx2AB0ie4SxgCDCv4FxmZtZAeZqVNkj6p4j4ZT0CmZlZ4/XYrCTpvQAuDGZmW5daVw7/Iuk36bSAAHYFtouICwpNZmZmDVOrOPwsImZ1zEg6A5gMnNPTTpJ2B84lGcX1JOASkr6KCUAbsDIi5qX9GdOBp4C9gGkR8bqkCcCewHDgpohY1psXZ2ZmvVOrOHwPQNKOwLeAN4EPR8Rfa+x3NPBARNwh6VWSQnEEcHpErJO0QNLtJMVieUTMltQKjJd0GzAxIk6VNBSYg0eBNTOrqx77HCIiJB0J3AcsjIhJEfFXSaqx3/yIuCOdHQ48BuwaEevSZU8ChwMtwMPpssXp/DhgRXqcNmBU+i1tMzOrkx6vHCRdDHwSODMillet+gzw/VoHl/RO4FDgauBTVavagN2Akel0d8sA1gPDSG6lrT52K9AK0NTURKVSqRWnW02DkyeWlUGZskI2b1/+TkVrb28f0Pk6K1PeMmWFcuUtKmutZqUJwC9Jmns6lomk2ajH4iBpe+By4GLgr8DOVauHAc8Dq9PpZzotG1e17SBgbefjp30hswCam5ujpaWlxkvp3vVz5jNzad4xCBtrypiNpckK2byrzmxpXJgaKpUKffl3VG9lylumrFCuvEVlrfUuc15E/L/OCztuce1O2gx0GfD1iHglXbZG0pC0aWlvYCmwkOTZEI8CY4F7gEXARek+Q4HnIiI261WZmVmf9FgcuioM6fKHahx3MvAJ4P3pFUcbyVXElZLWAjdERLukm4HpkiYBo9l0t9JcSVOAESSPJzUzszoqpH0iIq4Druti1ZJO220Aruhi/7lF5DIzs3x8F5CZmWW4OJiZWYaLg5mZZbg4mJlZhouDmZlluDiYmVmGi4OZmWW4OJiZWYaLg5mZZbg4mJlZhouDmZlluDiYmVmGi4OZmWW4OJiZWYaLg5mZZbg4mJlZhouDmZlluDiYmVlGIY8JBZC0IzAJGA9cEhEPSjoMmEDyTOmVETFP0g7AdOApYC82PUd6ArAnMBy4KSKWFZXVzMzerrDiEBGvAtdJGsumK5RrgdMjYp2kBZJuJykWyyNitqRWYLyk24CJEXGqpKHAHOCUorKamdnb1a1ZSdIgYNeIWJcuehI4HGgBHk6XLU7nxwErACKiDRglyU1gZmZ1UtiVQxeGA+1V823AbsDIdLq7ZQDrgWHAyx0L0quMVoCmpiYqlUqvgzUNhiljNvZ6/3oqU1bI5u3L36lo7e3tAzpfZ2XKW6asUK68RWWtZ3FYA+xcNT8MeB5YnU4/02nZuKptBwFrqw8WEbOAWQDNzc3R0tLS62DXz5nPzKX1/FX03pQxG0uTFbJ5V53Z0pAco6cuqLnNlDFvMPNX62put7lWzTi5348JSaHty7/7eipTVihX3qKy1q2pJiI2AGskDUkX7Q0sBRYCzemyscA9wCLgAIC0z+G5iIh6ZTUz29oVebfScODjwBhggqSdgcuBKyWtBW6IiHZJNwPTJU0CRrPpbqW5kqYAI4CpReU0M7OsIu9Wegm4If2ptqTTdhuAK7rYf25R2czMrGe+A8jMzDJcHMzMLMPFwczMMlwczMwsw8XBzMwyXBzMzCzDxcHMzDJcHMzMLMPFwczMMlwczMwsw8XBzMwyXBzMzCzDxcHMzDJcHMzMLKM8jxQzK6k8T6HrjSljNnJ2D8cu6gl0tnXwlYOZmWW4OJiZWYaLg5mZZQzoPgdJFwAC9gS+ERHPNjiSmdlWYcBeOUg6ADgiIq4DZgFXNzaRmdnWYyBfOZwALAaIiMclNTc4j5nl1PkOrVp3VvUX36HVfwZycRgJ/LFqflCjgphZOfTXbcP1Kmb94caPDCnkuIqIQg7cV5I+AwyOiG+l84siYlzV+lagNZ19D7CiD6cbAbzYh/3rqUxZoVx5y5QVypW3TFmhXHn7knWviBjZ1YqBfOVwLzAVQNJ7gEXVKyNiFklfRJ9JejgiStFsVaasUK68ZcoK5cpbpqxQrrxFZR2wxSEinpD0iKR/APYGpjc6k5nZ1mLAFgeAiLi+0RnMzLZGA/ZW1jrrl+apOilTVihX3jJlhXLlLVNWKFfeQrIO2A5pMzNrnAHdrGRmWwZJ2wJHAxsj4sFG57HaturiIGlHYBIwHrhkIP+jlbQ7cC7wEnASSd7fNzZV1yQNAa4CngBOBL4WEYsbm6o2STsBD0XEQY3O0hNJdwAb0tkVETGlkXlqkfQu4HxgdkQ80eg83ZE0EZhYtWjHiPhQo/LUkhbcacALwMHAtRHxdH8df6suDhHxKnCdpLEM/P6Xo4EHIuIOSa+SFIqpDc7UnT2AxRHxE0mPA5cBZzQ4U48kCTibpPgOdLdExI2NDpGHpG2ArwDnRUR7o/P0JCJuAW6Btz6Mnd7YRDWNAXaPiC9IOgo4Dfi3/jr4Vl0cyiQi5lfNDgceaVCUmiJiJbAynd2T5DsrA91pwALgfzU6SA5jJe0N7Ah8NSJWNzpQD04kGTzzAkmjgZkR8XhjI+VyHvCdRoeoYTlwpKSDgSOB2/rz4C4OJSPpncChwLcanaUnkkaQDJY4mOR/tAFL0qHA+oh4MrmAGNgi4kIASccCPwBOaWyiHh0GLImIGemb2GzgfQ3O1CNJO5N8Iv9zo7P0JCJek3QJMBe4D7ihP48/0JtSrIqk7YHLgYsjYn2j8/QkIl6MiMnATQz8T2DjgeMkzQD2kTRD0qhGh6olIn4N7NfoHDW8xqZx0R4jaXIc6CYBP210iFokHQh8BDiCZASJb/bn8X3lUBJp2+1lwNcj4pVG5+mJpGOAv0TEMmAVyT/eASsipnVMSzomIgZqXw6STgZejYiFkobz9sEpB6J72TTc/i4M8PGK0g9gH6Of32gLchLweESEpFuAc/rz4Ft1cUj/5/o4ScfOBEk7R8QvGxyrO5OBTwDvT5s+2iJiYo97NM5q4NOSngWOAi5ubJwtyq+AiyS9m6R5cXKD8/QoIpZKuj99cNduwAWNzlTDeOCOiHiz0UFy+DFwmaRzgX3o52fe+EtwZmaW4T4HMzPLcHEwM7MMFwczM8twcTAzswwXBzMzy3BxsC2KpEMkTZe0b8Hn2U7S+ZJ+IumjRZ6rryR9VNLVknJ9GVHSsZIqObd9n6Sv9imgDUguDlZXko6Q9ICkz6Xz20g6T9J/Szq6H06xCjgE6PKh6f3oLODBiBgfEW+NaSPpDEkvSJok6UJJ35Z0Yq2DFVVgJA0Fjo2IqyPiH6qWj5O0QNIsSZdJ+p6kE9LV9wP/s4dj7pkOOQLwKHB8EdmtsbbqL8FZ/UXEI5IWAmdJuicdyvs/JO0REQ8ASBrUMTyIpO2S3eKNnMdfJ6keo3+OA37SxfnnSvpCRHwfQNIOwD2S1kTEI10dKB0R9nz6eeC01AHAc13kXCTpIWB5RPwo/ULofwHHAk3AQcDd3RzzeOAN4NGI+Iuk1wrIbQ3m4mCN8BrwSeBmSSdUD+Us6STgWuCI9DkA/wl8OR36+xvACpJnGZwI/DvJ2ELvBZZFxOerzvG3kt4HvB+4OiKWSWoG/h54iuTbulNJxu8/F5gJXARcmA770ZHnkjTvDsBLEfFDSX8L/B3wnKQHI+LO7l5oRGyQ9J8k4/VMlnQ2sB44EPhdRPwY+DRwtKTJJAViG+BzwO9Jvvl6GcnIppcDfwAO6/wMh/S1nkQynMa+wJeAESTfSN5VkiKip8EaDwLuS4ePuJBklM+7JX2y0+/nkvR3tloSEfGj9PwnkwyodwrwsYh4podzWRlEhH/8U9cfkjdrSN6ob6helk5XqrcFWtLps0keHATJ8y1+TvKmuR3wWNU+NwLvT6ePr9ruIWBwuvwLwIfT6ceB93aR88PAjKr5OcD+HRlJRu7s6vUt7zT/UWBBOn1GmmUX4L+72ge4C2iq+h1NAj5I8qzg7TtnTY93P7BdOv8/gGuqfmdTu/s7AN8Fvgf8H2Bkunw0cHvVdm/7/aT7Taj+e1XlvQL4dKP/jfmn7z/uc7CGiYjZQKRP4MqrY+C2F0g+yUdEbCR5g6zWMTbOgySfpEcAOwEflzSB5FkIHc0hb0bEQ12c60iSMfM7PAEcvhlZO7ybTc+3WAK0khStQd1svy/wwTTnXiQPILo3zVIBmjttP4JkyPGN6fxjwNic2e6JiHNJCtDP02awzrr7/bwlIl5IJ/9E0ixlJedmJWu0ycCdJJ/qO+wkafuIeJ3ev9F0FIv9gGXAGpLmnPkR8VdJPwbeUeMYS4ATqub3BX60WSGkdwCfAs6RNJjk0/9xJM1Ul3aRF+BZYGHHG66Sx9keSdKMdj0wT9LCiOgoXC8C75C0bSR9Mwek2TfH2jRTrd8JQOCbWbZ4Lg5WV+njDI+R1Ar8ICJelfQZ3j666DzgVkkPkryhT5S0hORNdZf0gUenAAdI2p/kk/muaX/FX0g+NX8gvfvmEOCKiHhT0qXAjZJ+R/KG+kNJH0j3PS0iftYp7h3A4ZI+S/LGWYmI5ek+fwOcIWlWRKyren1nAcPTkTIHkRSUiyNiRdrx/AxJ+/1y4CVJp0bEz4GHJV1G0sn9j8D1klYAf06XjSLpLN5A0u+wquOcERHpa/t8OhLu/sCXJP0N0AIMk3RoRDxalbOZpGluVNq3sxfJU+Xa0r6P0Uoen7tzF7+fu0j6T14Hnk6PcQrwm6q/UVPV1YSVkEdlNTOzDF8amplZhouDmZlluDiYmVmGi4OZmWW4OJiZWYaLg5mZZbg4mJlZhouDmZll/H9V1L8ygxgkdgAAAABJRU5ErkJggg==", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "adult_pii['DOB'].value_counts() .hist()\n", "plt.xlabel('Number of Dates of Birth')\n", "plt.ylabel('Number of Occurrences');" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can do the same thing with ZIP codes, and the results are even worse - ZIP code happens to be *very* selective in this dataset. Nearly all the ZIP codes occur only once." ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "tags": [ "hide-input" ] }, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYwAAAEGCAYAAAB2EqL0AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMCwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy86wFpkAAAACXBIWXMAAAsTAAALEwEAmpwYAAAbLUlEQVR4nO3de5xV5X3v8c9XvBRBUQQn2qqIojaKF5gYK17GxNqcWqsnbT1EY2K8kKTBS8Qq8RbRhJJ4O1VjEkzrlejJy5OX9ETbY1TGe7xgPVKjJkYxilEqRgigIvA7f6xnx51xmP3sGdbsBfN9v17zmnXd67sfcf9mrWetZysiMDMza2SDVgcwM7N1gwuGmZllccEwM7MsLhhmZpbFBcPMzLK4YJiZWZYNWx2gTCNGjIhRo0b1at9ly5YxZMiQtRtoLXCu5jhXc5yrOetrrrlz574ZESM/tCIi1tuf8ePHR2/NmTOn1/uWybma41zNca7mrK+5gCeim89UX5IyM7MsLhhmZpbFBcPMzLK4YJiZWRYXDDMzy+KCYWZmWVwwzMwsiwuGmZllWa+f9O6LeQsWc/zUO/r9uPNnHN7vxzQzy+EzDDMzy+KCYWZmWVwwzMwsiwuGmZllccEwM7MsLhhmZpbFBcPMzLK4YJiZWRYXDDMzy+KCYWZmWVwwzMwsiwuGmZllccEwM7MsLhhmZpbFBcPMzLK4YJiZWRYXDDMzy+KCYWZmWVwwzMwsiwuGmZllccEwM7MsLhhmZpbFBcPMzLK4YJiZWRYXDDMzy+KCYWZmWVwwzMwsy4ZlvKikjwAnA28BhwFnAEOAicAS4IWIuE3SxsA04GVgB+CCiHhf0kRge2A4cFNEPCNpO+AUYAHwfkRcU0Z2MzPrXikFA/g48GhE3CVpOUXx2Bv4m4hYJukOSf9OUUCei4gbJE0CjpZ0J3BMRPy1pM2BWcARwHTg/IiYL+lGSXdGxPyS8puZWRelXJKKiNkRcVeaHQ48C2wZEcvSspeAvYAO4Im07Mk0Px54Pr3OEmBbSRsAe9QViKeBCWVkNzOz7pV1hgGApK2APYALgc/VrVoCbA2MTNNrWgbwHjAM2KSb/bs75iRgEkBbWxudnZ29yt42GKaMXdmrffuiUd6lS5f2+j2Vybma41zNca7mlJWrtIIhaSPgbOB04B1gs7rVw4DXgYVp+pUuy8bXbbsJ8Dbwbpf9X+3uuBExE5gJ0N7eHh0dHb3Kf9Ws2Vw2r9R62q35x3b0uL6zs5PevqcyOVdznKs5ztWcsnKVckkqXUL6B+DyiFgcESuARZKGpE12BOYBc4D2tGwccB8wF9glvc7mwGsREcA8SaPStnsCD5WR3czMulfWn9CTgaOA/SVBcQnpbOAcSW8D10XEUkk3A9MknQSM4oO7pG6RNAUYAUxNr3keMFnSK8DPIuLXJWU3M7NulFIwIuJK4MpuVj3dZbsVwNe62f+Wbpa9QlF0zMysBfzgnpmZZXHBMDOzLC4YZmaWxQXDzMyyuGCYmVmWrIIhabSkrSQNkTRJ0o5lBzMzs2rJPcM4juJJ7QuBzSlGjTUzswEkt2D8CvgtsE9EXAo8U14kMzOrotyC0QbcCVwuaQzFU9lmZjaAZD3pHRGXSfpORLwrqQ34Qcm5zMysYnI7vccA10o6F1gMbFlqKjMzq5zcS1KnAecCiyLiXeDg8iKZmVkV5RaMF9PosG+l+W1LymNmZhWVWzC2lfRpYGtJf40LhpnZgJNbMM6n+NKiTwEHAKeWlsjMzCop9/sw3gW+ExH/JWkYsLTETGZmVkG5ZxgXAfun6WXAl8uJY2ZmVZVbMJ6LiNkAEbES2Kq8SGZmVkW5BWM3SYMAJG0O7FVeJDMzq6LcPozbgfslrQRGAF8pLZGZmVVS7tAgc4EJkkYAi/D3aJiZDTi5ZxhI2g3YOs0eBpxXSiIzM6ukrIIh6UHgDWBhWrRbaYnMzKyScs8w5kbEabUZSaNLymNmZhWV2xfxiKQt6uZ3KCGLmZlVWO4ZxmTgonSXlIAtgG3KCmVmZtWTWzBmRsSNtRlJHaWkMTOzysq9rfZGSZ+gGFNqLvB4qanMzKxycr9xbwbFKLVHRMR7wOdKTWVmZpWT2+m9MCIuAp5K8yPKiWNmZlWVWzBGSdoEiPR79xIzmZlZBeV2et8APAAMBaYCZ5aWyMzMKim3YCyPiH1rY0lFRJQZyszMqif3ktSVABHxpouFmdnAlF0wJI2vzUg6qpw4ZmZWVbmXpKYA20haxQdPet9eUiYzM6ug3IJxbUTMqs1IOqjRDpI2BU4CjgbOiIjHJJ0NHFi32ZeBV4ELKEbCHQ1cHBFL0oOCBwKDgLsj4v40ntW5wIvAiIi4ODO/mZn1UW7BmAD8vmBExP2NdoiI5RSXssbxwaWvNyLir+q3k3Ro2v67kg4DTpH0LYrv2/gkxRnN3Wm7M4GfRMR9kqZLmhARD2W+BzMz64PcPox36mck9Xa02o0knSPp6nQGAdABPJGmn0zzo0h3Y0XEamA5xWCH3W1rZmb9IPcM40VJtwGPpfmPAX/X7MEi4lr4/eWqeyX9LTASWJI2WULxrX71y+qXD6coHvXL/oCkScAkgLa2Njo7O5uNCUDbYJgydmWv9u2LRnmXLl3a6/dUJudqjnM1x7maU1au3IJxAPB/6ubf7stBI2K5pAeAfSj6LoalVcOA17ssq1++CBgCLK1b1vW1ZwIzAdrb26Ojo6NXGa+aNZvL5mV/g+1aM//Yjh7Xd3Z20tv3VCbnao5zNce5mlNWrtxPxDMi4je1GUm3N3sgSYOBr0XEBWnRDsCzwDLgIOAnwDjgPuBlYLgkUfRhDKb4itg5QDvQmbatL2JmZlaihgVD0gbAGElj6hYfA3ypwX7DgSOBscBEYDPgt+ky1I7AnRHxgqRfAQdKOhnYheIuqZWSvkHR8T0IuDAiVku6HDhH0i4UT58/3OwbNjOz3sk5wxDFrawPpek2PuhHWKOIeAu4Lv3U/LSb7QKY1s3ye4F7uyx7GzgrI7OZma1lDQtGRKySdEJELKgtk3R+ubHMzKxqsm6r7VIsBgH7lpbIzMwqKacPY2PgUeC3FJekhgC3lZzLzMwqJueS1ApJZ0XEh/ofzMxs4Mh90vspSftCcfeTpG1KzGRmZhWUWzBOId0Zle5+Oqa0RGZmVkm5BePFiPjPuvmtyghjZmbVlVswdpP0UUkbSDqQ4gE7MzMbQHKHBrk0/YwDfg6cUVoiMzOrpKyCERFvSvpiRLwnaZOIeK/sYGZmVi1Zl6QknccHw5lvJOmE8iKZmVkV5V6SeiEibgWIiKW+rdbMbODJ7fTeuTaRhgbZq5w4ZmZWVblnGI9Jugd4haJYTC8vkpmZVVFup/ddkh4BdqV4JuOtcmOZmVnV5Aw+OJLiye6dgBeAl8oOZWZm1dNjH4akPYF/A0YCzwBbA3dI+mg/ZDMzswppdIbxBeCQiPhdbYGkbwFfB84sM5iZmVVLo7ukXq4vFgBp/tXyIpmZWRU1Khibr2H5sLUdxMzMqq3hGYakSyWNlrRx+n0J8Jv+CGdmZtXRY8GIiBuAp4AfA4vS72cjYmb50czMrEpyvqL1ZuDmfshiZmYVljs0iJmZDXBrLBiSPtWfQczMrNp6OsM4vjYh6RP1KyQNLiuQmZlVU08FY46kvVNx2FfS1rUfYFI/5TMzs4roqdP7B8BxwInAXwCHAUrrtgX+qdxoZmZWJWssGBGxCrgeuF7SoRFxd22dpIP6IZuZmVVI7vDmd0vaieK7MP4zIu4vN5aZmVVNVsGQdBzFZalfAkdJmhMR15WazMzMKiX3G/eGR8RnazOSTi0pj5mZVVTug3vvNZg3M7P1XO4ZxmBJ04HngDHA4vIimZlZFeV2el8h6ZPAPsADEXFXubHMzKxqcs8wiIh7gHtKzGJmZhWWXTCaJWlT4CTgaOCMiHgsfUf4RGAJ8EJE3CZpY2Aa8DKwA3BBRLwvaSKwPTAcuCkinpG0HXAKsAB4PyKuKSu/mZn9oaxOb0l/2uwLR8TyiLgSeKHuON8GvhkRM4AvSBoKfA54LiK+B7wEHC1pS+CYiPg2MB2YkfafDlwTEf8E7CdpVLO5zMysd3LvkrqyrweStAmwZUQsS4teongQsAN4Ii17Ms2PB54HiIglwLaSNgD2iIj5adungQl9zWVmZnlyL0ldKWl8RMwFkHRURNze5LGGA0vr5pcAWwMj0/SalkFxG+8wYJNu9v8DkiaRBkdsa2ujs7OzyZiFtsEwZezKXu3bF43yLl26tNfvqUzO1Rznao5zNaesXLkFYwqwjaRVFAMQbgHc3uSxFgGb1c0PA14HFqbpV7osG1+37SbA28C7XfZ/tetB0tfHzgRob2+Pjo6OJmMWrpo1m8vmldbFs0bzj+3ocX1nZye9fU9lcq7mOFdznKs5ZeXKvSR1bUTsGhEfjYg/Bf5HsweKiBXAIklD0qIdgXnAHKA9LRsH3AfMBXYBkLQ58FpEBDCvrt9iT+ChZnOYmVnv5D6HMSt9idK7FB/mcxvtI2k4cCQwFpgoaTPgbOAcSW8D10XEUkk3A9MknQSM4oO7pG6RNAUYAUxNL3seMFnSK8DPIuLXTbxXMzPrg9zBB2cAy4HBEfGwpBOA7/a0T0S8BVyXfuo93WW7FcDXutn/lm6WvUJRdMzMrJ/lXpJaGBEXAU+l+RHlxDEzs6rKLRij0m2xkX7vXmImMzOroNzbgG4AHgCGUvQnnFlaIjMzq6TcTu+5wL6SRgCL0h1LZmY2gGQPDSJpNsUtsLdI2rHcWGZmVjW5fRjTgEsoHqa7GPh6aYnMzKyScvswHo6IB9P0M5J+XlYgMzOrpjUWjHTZaWSaHSJpP2A1xdAgZmY2wPR0hvF5ittnawMG7ly3bh+KocrNzGyA6Klg/AiYHxHLu65IZxtmZjaArLFgRERP/RRDS8hiZmYVljuW1DeAT9dmKYY336akTGZmVkG5d0ntDoyNiFUAkg4pL5KZmVVR7nMYj9eKReJhxc3MBpjcM4wXJN0OvEVxSWoMcEBZoczMrHpyC8aXKAYdfCPNH1FOHDMzq6rcgnFPRDxWm5E0q6Q8ZmZWUbkFY2NJ3wbeTPPtwNHlRDIzsyrKLRi7Av9WN7+4hCxmZlZhuQXjqxHxm9pMGurczMwGkIYFQ9IGwBhJY+oWH0PREW5mZgNEzhmGgHOBh9J0G/Ch8aXMzGz91rBgRMQqSSdExILaMknnlxvLzMyqJutJ7y7FYhCwb2mJzMysknL6MDYGHgV+S3FJaghwW8m5zMysYnIuSa2QdFZE/LQ/ApmZWTX1eElK0scAXCzMzKzRGcbFkh5O0wIC2BLYMCJOKTWZmZlVSqOC8eOImFmbkfQZYDJwQqmpzMyschoVjGsBJG0KXA2sBg6NiHfKDmZmZtXSYx9GRISkfYAHgDkRcVJEvCNJ/RPPzMyqolGn9+kUZxnHRsRNdatOLDOUmZlVT6NLUhOBnwJH151UCPg48IMSc5mZWcU0KhhfjIj/13Vh7XZbMzMbOBr1YXyoWKTlj5cTx8zMqiprLCkzM7PcL1BaayTdBaxIs88D/xM4BVgAvB8R16S7sC4AFgKjgYsjYomkTwAHAoOAuyPi/v7Ob2Y2UPV7wQB+GBHX12Yk3QScHxHzJd0o6U5gZ4CI+K6kw4BTJH0LOA/4JEXH+92SDo2I1f3/FszMBp5WXJIaJ2mapEskbQ3sERHz07qngQlAB/BEWvZkmh8FLIrCaoovcdqmH3ObmQ1oiojWHFiaAEwFdoqIj6ZlkyiGT98NuDkiHkjDqz9O8ZWwJ0XEiWnbHwKXRMR/dHndScAkgLa2tvG33nprr/ItfGsxb7Tgefaxfzysx/VLly5l6NCh/ZQmn3M1x7ma41zN6WuuQw45ZG5EtHdd3opLUgBExEOSdgbqP5aHAa8Cw9N0bdnrFP0Zw7ps+3o3rzsTmAnQ3t4eHR0dvcp31azZXDav/5tn/rEdPa7v7Oykt++pTM7VHOdqjnM1p6xc/XpJStLhkg5J08MpisM8SaPSJntSfHf4HKBW3cYB9wEvA8NV2AAYDLzRj/HNzAa0/v4T+kHgNEnbAXtQjHy7HJgs6RXgZxHx6zR9oKSTgV0o7pJaKekbFB3fg4AL3eFtZtZ/+rVgRMRi4KJuVp3dZbsApnWz/73AveWkMzOznvjBPTMzy+KCYWZmWVwwzMwsiwuGmZllccEwM7MsLhhmZpalZU96W/WMmnpHr/edMnYlx/dy//kzDu/1cc2s//gMw8zMsrhgmJlZFhcMMzPL4oJhZmZZXDDMzCyLC4aZmWVxwTAzsywuGGZmlsUFw8zMsrhgmJlZFhcMMzPL4oJhZmZZXDDMzCyLC4aZmWVxwTAzsywuGGZmlsUFw8zMsrhgmJlZFhcMMzPL4oJhZmZZXDDMzCyLC4aZmWVxwTAzsywuGGZmlsUFw8zMsrhgmJlZFhcMMzPL4oJhZmZZXDDMzCzLhq0O0CxJpwACtgeuiIgFLY5k1rR5CxZz/NQ7WnLs+TMOb8lxbd23Tp1hSNoF2DsirgRmAhe2NpGZ2cCxThUM4CDgSYCI+AXQ3to4ZmYDx7p2SWok8Grd/CatCmJmzRnVh0twU8au7NMlvFZdhuvLe+6L6z81pJTXVUSU8sJlkHQiMDgirk7zcyNifJdtJgGT0uyuwPO9PNwI4M3eZi2RczXHuZrjXM1ZX3PtEBEjuy5c184w7gemAkjaFZjbdYOImEnRv9Enkp6IiMpd8nKu5jhXc5yrOQMt1zpVMCLil5KekvT3wI7AtFZnMjMbKNapggEQEVe1OoOZ2UC0rt0l1Z/6fFmrJM7VHOdqjnM1Z0DlWqc6vc3MrHV8hmFmZlkGfMGQtKmkUyU9KGnfbtZvIekSSV+WdH6FcnVIulfST9LPX/ZTro9IOl/SVyTNlrRTl/Wtaq9GuVrVXkMkzZB0oqQfShrXZX2r2qtRrpa0V93xh0p6tpvlLWmvjFwtay9Jd9Ud97Iu6zaW9I+SvpR+b9Sng0WEf4rLctcD+3Wz/BvAwWl6OjChIrk6gFEtaKcjgcPS9BeAGVVor4xcrWqvnYGj0/SBwC0Vaa9GuVrSXunYAiYDD3WzrmX/PzbI1cr2Or6HdScBn0/Tk4Bj+3KsAX+GkaEDeCJNP5nmq+A9YJKkS9NfD+qPg0bE7Ii4K80OB57qskkHLWivjFytaq8XIuJHaXZ7imeJ6nXQmvZqlKsl7ZV8GrgDeL+bdR207v/HnnK1sr3GSZqWzry27rKug7XYXuvcbbUtMBxYnqaXAF3/g7RERDwCPJL+YX6f4h/z/+6v40vaCtgDuLrLqpa215pytbK9JI2gGChzMPDFLqtb1l495WpVe0naA3gvIl5aw2duS9qrUa5W/vuKiFNTxgnAPwNH1K0eSdFOsBbay2cYjS0CagOzDANeb2GWD4niXHM2sH9/HTNdBz0bOD0i3uuyumXt1SAX0Jr2iog3I2IycBNwTZfVLWuvBrlq2/R3ex0NHCBpBjA69bNsW7e+Ve3VKBfQmn9fdcd+iOJSY72FFO0Ea6G9XDC6IWlk3andHD4YFXcccF9rUv1hLknT6zqwRgPz+inDBsA/AJdHxOKuuWhRezXK1cL22k/S7ml2PrB3Rdqrx1ytaq+IuCAipkbEVODF9Pv9VrdXo1wt/Pd1uKRD0vRw4FVJm0vaPm2yVttrwF+SSo18JDAWmChpM4ph1AcB5wCXA+eo+C6O5RHxcEVyPQgcJ2kTitPMbv9CLMFk4Chg/3RqvgT4FS1ur4xcrWqvhcAXJC0A9gVOB06l9e3VKFer2qs7VWivRrla1V4PAqdJ2o7iUuxkirOhQ4BjgZuBaZJOAkYBF/TlYH5wz8zMsviSlJmZZXHBMDOzLC4YZmaWxQXDzMyyuGCYmVkWFwwzQNLuaXiFnRpv3afjbJiGjvhRfw5Ql0vSKElT1c2Al2YuGLZOkLS3pEclfTXNbyDpi5L+r6SPr4VDzAd2pxhKoUzHAY9FxNERcSf8/kP6EUmT0883lUZElfRZSQslnSxpkKRj6+bPkHSZpI27HkTSX6Vxjb4q6euSzsnM9xuKJ4JHr603bOsPFwxbJ0TEUxRPrR4naVxErI6I7wOPRMSjAOmhKdL0hpIGNfH6y4Clazl2d8YDz3dZtgL4fkRcDXyH4gGs41Kum4G3IuLaiFgVEbPq5i+nKHQn17+YpP8OfAY4KyKuiIhpwJ/khEtDqizu9buz9dqAf9Lb1invAp8FbpZ0UET8/gNe0mHAtymGuPgT4F+A6ZJ+AVxB8SG9Avgk8D2KMXc+BjwTEefWHePPJf0ZxVhAF0bEM5Lagc8DL1M8xTsVOIbig/oy4DTg1Ih4pi7PGSnvxhQf8DdK+nPgL4DXJD0WEXcDRMRrFMPYQzEc9X9ERG2E0UZe5MMjkJ4PfC4iVtcWRMTfp1x/BhwGvArsBHwzIpZJ+jvgYOCl1C7z0/Y7A1+leGp+NMXwK6IYs+tFYM+ImJKZ1dZ1rRi/3T/+6c0PxQc4FB/e19UvS9Od9dsCHWn6eODSNP1x4F8pPvQ2BJ6t2+d6YP80fWDddo8Dg9PyrwOHpulfAB/rJueh1H0fBzALGFPLCHxkDe9vNMVZ1EZdlj/X3XzKNhM4sMv6ZcCm3by+gEeADdP8fwMuArakGGKiNvLDNGBimr4HaKtr95Mohp2YCWzU3fv3z/r740tSts6JiBuAkHRME7u9mX6/QfEXf0TESooP0Xq1v8ofo/gLfAQwFDhS0kRgU4ozB4DVEfF4N8faB3iubv6XwF49hUuXz74LfCUiuvu+hXqDJH0e+BLwLxHxQJf1LwA7dLPfCIohulem+WcpBqQbA/wqImrjBNWPF7QTcEh67zsAb1F8d8ZzFMWvHRswfEnK1lWTgbsp/vqvGSppo/SB29bL160VkJ2BZyiG034PmB0R70j6X8AfNXiNpykGiqzZCbi1wT6nA/8eET8HkPS3wMtrKEirUtFckxnAWZJOqBUBSVcAZwB/JGlQRKwCdklZXwJ2lKS0fX3fzwJgTkS8kV5nU4qC+D3gKuA2SXMior5A2nrKBcPWCek2z/0kTQL+OSKWSzqRonDU3AbcLukxig/5YyQ9DRwAbKHiy5WOAHaRNAbYDtgy9X/8juKv7YMlHURxx9TXImK1pDOB6yX9nOJM5UZJB6d9Px0RP+4S9y5gL0lfpujD6IyI59I+fwx8RtLMKDrakbQjRd/AdEm19/M3FKOMHgcMr71v4C+BYamg/GtErOimuW4FVgGXSnqN4gxpUUREei/nplFqx1D0YfxO0nXAVanPZ2dgt9SOX0nLnwf+C/gRsC0wgaJP6EVSf4et/zxarZmZZXEfhpmZZXHBMDOzLC4YZmaWxQXDzMyyuGCYmVkWFwwzM8vigmFmZllcMMzMLMv/BwtHyPQJoNpPAAAAAElFTkSuQmCC", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "adult_pii['Zip'].value_counts().hist()\n", "plt.xlabel('Number of ZIP Codes')\n", "plt.ylabel('Number of Occurrences');" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### How Many People can we Re-Identify?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this dataset, how many people can we re-identify uniquely? We can use our auxiliary information to find out! First, let's see what happens with just dates of birth. We want to know how many *possible identities* are returned for each data record in the dataset. The following histogram shows the number of records with each number of possible identities. The results show that we can uniquely identify almost 7,000 of the data records (out of about 32,000), and an additional 10,000 data records are narrowed down to two possible identities." ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "scrolled": true, "tags": [ "hide-input" ] }, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAX8AAAD4CAYAAAAEhuazAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMCwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy86wFpkAAAACXBIWXMAAAsTAAALEwEAmpwYAAASZElEQVR4nO3da4wd5X3H8e8fCAhs2MQxLAkBDIpIglgC9iakuVTrEJFIFBKFynJBqmhjliTCTcTWwqDEwUqVEBLSFtJUtZRSwsWm5Q0vTBGN4oU0zQVMI29puFWYgCl2wLHdNVeLf1+ccdmu1vbunF2Px8/3IyHOmZnnzG9Wx78z5zm3yEwkSWU5pOkAkqT9z/KXpAJZ/pJUIMtfkgpk+UtSgSx/SSrQYU0HmIy5c+fmvHnzao3duXMns2bNmt5AM6hNeduUFdqVt01ZoV1525QVusu7fv36FzLz2AlXZuYB/9+CBQuyrnXr1tUe24Q25W1T1sx25W1T1sx25W1T1szu8gIP5R561WkfSSqQ5S9JBbL8JalAlr8kFcjyl6QCWf6SVCDLX5IKZPlLUoFa8Qnftpq3fO2Uxwz17eLSGuPG2njd+V2Nl3Tw88xfkgpk+UtSgSx/SSqQ5S9JBbL8JalAlr8kFcjyl6QCWf6SVCDLX5IKZPlLUoEsf0kqkOUvSQWy/CWpQJa/JBVon1/pHBFHAUuARcCVmfnLiDgTWAzsAJ7MzLsi4nBgJfA0cDKwIjNfj4jFwEnAHODWzHwkIk4ElgKbgNcz8/szcXCSpInts/wz8yXgxoiYz5vPFK4HLsrMnRGxNiLupfNg8Ghm3hIRg8CiiLgHuDgzL4yIY4DbgQuAbwBfzcyNEfHDiLgnMzfOwPFJkiYw5WmfiDgCeFtm7qwWPQW8HxgAHqqWPVxdXwA8BpCZO4B3RsQhwBljyn4D8JF68SVJddT5Ja85wOiY6zuA44Bjq8t7WgbwKtADHDHB+P+nevYwCNDb28vw8HCNqDA6Olp7bLeG+nZNeUzvkfXGjbW/jrfJv20dbcrbpqzQrrxtygozl7dO+b8IHD3meg/wPLCluvzMuGULxmx7BLANeGXc+GfH7yQzVwGrAPr7+3NgYKBG1E4R1h3brTo/xzjUt4sbRrr7dc2Nlwx0NX6ymvzb1tGmvG3KCu3K26asMHN5pzztk5mvAS9GxKxq0SnACLAO6K+WzQfuB9YDpwFUc/7PZWYCIxExr9r2TOCndQ9AkjR1k3m3zxzg00AfsDgijgauAq6JiG3AzZk5GhG3ASsjYgkwjzff7bM6IoaAucDy6ma/AlwREc8AP8/M30z3gUmS9mwy7/bZCtxc/TfWhnHbvQZcPcH41RMse4bOA4gkqQF+yEuSCmT5S1KBLH9JKpDlL0kFsvwlqUCWvyQVyPKXpAJZ/pJUIMtfkgpk+UtSgSx/SSqQ5S9JBbL8JalAlr8kFcjyl6QCWf6SVCDLX5IKZPlLUoEsf0kqkOUvSQWy/CWpQJa/JBXI8pekAln+klQgy1+SCmT5S1KBLH9JKpDlL0kFsvwlqUCWvyQV6LA6gyLiUGAFsBk4HbgeeCuwGNgBPJmZd0XE4cBK4GngZGBFZr4eEYuBk4A5wK2Z+Ui3ByJJmrxa5Q/0Acdn5tci4oPAZ4FPARdl5s6IWBsR99J5MHg0M2+JiEFgUUTcA1ycmRdGxDHA7cAF03AskqRJqjvt8yhwdkScDpwN/DPwtszcWa1/Cng/MAA8VC17uLq+AHgMIDN3AO+MCKefJGk/isysNzDio8DfAD8Bvg38fWaeW637BvAg8HlgSWY+ExGnVdutAd6dmV+vtv034PzM/N242x8EBgF6e3sXrFmzplbO0dFRZs+eXWtst0Y2bZ/ymN4jYfPLMxBmBkyUte+EnmbCTEKT94WpalNWaFfeNmWF7vIuXLhwfWb2T7Su7pz/e+lM85wFXAosB44es0kP8Dywpbr8zLhlC8ZsewSwbfw+MnMVsAqgv78/BwYG6kRleHiYumO7denytVMeM9S3ixtG6s7G7V8TZd14yUAzYSahyfvCVLUpK7Qrb5uywszlrTvdch7weHaeNtwBnAG8GBGzqvWnACPAOmD3o8584H5gPXAaQDXn/1zWffohSaql7inmncCyiLgMOBW4FvgtcE1EbANuzszRiLgNWBkRS4B5vPlun9URMQTMpfOsYcaMbNpe6wxckg5mtco/MzcDfz7Bqg3jtnsNuHqC8avr7FeSND18l40kFcjyl6QCWf6SVCDLX5IKZPlLUoEsf0kqkOUvSQWy/CWpQJa/JBXI8pekAln+klQgy1+SCmT5S1KBLH9JKpDlL0kFsvwlqUCWvyQVyPKXpAJZ/pJUIMtfkgpk+UtSgSx/SSqQ5S9JBbL8JalAlr8kFcjyl6QCWf6SVCDLX5IKZPlLUoEO62ZwRBwKnAPsysxfTk8kSdJMq13+EfEu4PPALZn5REScCCwFNgGvZ+b3IyKAFcAW4FTg65m5IyI+DnwMOBT4UWY+0O2BSJImr1b5R8QhwLeAyzNztFr8DeCrmbkxIn4YEfcA7wbIzL+NiPOApRHxLeArwLlAAD+KiE9k5hvdHowkaXLqzvnvLu6lEfF3EXEacEZmbqzWbwA+AgwAD1XLHq6uzwNezI43gJeAd9TMIUmqITJz6oMihoC3ZOZ1EXE68AOgJzNPr9YPArOA9wK3ZeZPIuJw4EE6U0VLMvNz1bZ3AN/OzH8ft49BYBCgt7d3wZo1a2od4Jat29n8cq2hjeg9ktbknShr3wk9zYSZhNHRUWbPnt10jElpU1ZoV942ZYXu8i5cuHB9ZvZPtK7unP8rwO40v6Zz5r51zPoe4FlgTnV597Ln6cz/94zb9vnxO8jMVcAqgP7+/hwYGKgV9Kbb7+aGka5e196vhvp2tSbvRFk3XjLQTJhJGB4epu79aH9rU1ZoV942ZYWZy1t32ucB4Mzq8luBF4CRiJhXLTsT+CmwDtj9qDMfuB94GpgTHYcARwKba+aQJNVQ6xQzM0ci4mcRsRQ4js67fJ4FroiIZ4CfZ+Zvqssfi4jLgNPovNtnV0T8BZ0XfQ8FrvXFXknav2rPL2TmdyZYfNW4bRJYOcHYHwM/rrtvSVJ3/ISvJBXI8pekAln+klQgy1+SCmT5S1KBLH9JKpDlL0kFsvwlqUCWvyQVyPKXpAJZ/pJUIMtfkgpk+UtSgSx/SSqQ5S9JBbL8JalAlr8kFcjyl6QCWf6SVCDLX5IKZPlLUoEsf0kqkOUvSQWy/CWpQJa/JBXI8pekAln+klQgy1+SCmT5S1KBLH9JKtBh3QyOiNnAg5n5vog4E1gM7ACezMy7IuJwYCXwNHAysCIzX4+IxcBJwBzg1sx8pKujkCRNSe3yj4gALgW2VouuBy7KzJ0RsTYi7qXzYPBoZt4SEYPAooi4B7g4My+MiGOA24ELujoKSdKURGbWGxhxEfAwcDPwSeCBzDynWvc9YDXwBeCbmflIRPQDlwN3Ap/MzGXVtuuBD2TmG+NufxAYBOjt7V2wZs2aWjm3bN3O5pdrDW1E75G0Ju9EWftO6GkmzCSMjo4ye/bspmNMSpuyQrvytikrdJd34cKF6zOzf6J1tc78I+IM4NXMfKrzBIA5wOiYTXYAxwHHVpf3tAzgVaAH+N3YfWTmKmAVQH9/fw4MDNSJyk23380NI13Nbu1XQ327WpN3oqwbLxloJswkDA8PU/d+tL+1KSu0K2+bssLM5a3bMouAwyPio8CpwDLgxDHre4DngS3V5WfGLVswZtsjgG01c0gAzFu+dp/bDPXt4tJJbDcVG687f1pvT9pfapV/Zq7YfTkiPpSZV0bE+yJiVmbuBE4BRoB1QD/wH8B84H5gPfClauwxwHNZd+5JklTLdM4vXAVcExHbgJszczQibgNWRsQSYB5vvttndUQMAXOB5dOYQZI0CV2Xf2YOVP/fAGwYt+414OoJxqzudr+SpPr8kJckFcjyl6QCWf6SVCDLX5IKZPlLUoEsf0kqkOUvSQWy/CWpQJa/JBXI8pekAln+klQgy1+SCmT5S1KBLH9JKpDlL0kFsvwlqUCWvyQVyPKXpAJZ/pJUIMtfkgpk+UtSgSx/SSqQ5S9JBbL8JalAlr8kFcjyl6QCWf6SVCDLX5IKZPlLUoEsf0kq0GF1BkXE8cBlwFbgPOBKYBawGNgBPJmZd0XE4cBK4GngZGBFZr4eEYuBk4A5wK2Z+UjXRyJJmrRa5Q+cA/wiM++LiJfoPBCcBVyUmTsjYm1E3EvnweDRzLwlIgaBRRFxD3BxZl4YEccAtwMXdH8okqTJiszs7gYihoAXgC9m5jnVsu8Bq4EvAN/MzEcioh+4HLgT+GRmLqu2XQ98IDPfGHe7g8AgQG9v74I1a9bUyrdl63Y2v1xraCN6j6Q1eSfK2ndCTyNZRjZt3+c2M/G3nanjHR0dZfbs2TNy2zOhTXnblBW6y7tw4cL1mdk/0bq6Z/4ARMTbgTOAa4E/HrNqB3AccGx1eU/LAF4FeoDfjb3tzFwFrALo7+/PgYGBWhlvuv1ubhjp6jD3q6G+Xa3JO1HWjZcMNJLl0uVr97nNTPxtZ+p4h4eHqXufb0Kb8rYpK8xc3tov+EbEW4CrgC8D/w0cPWZ1D/A8sKW6vKdlAEcA2+rmkCRNXa3yj4hDgGXAdzNze2a+BrwYEbOqTU4BRoB1wO6nHPOB+4H1wGnV7RwDPJfdzj1Jkqak7nPgK4DPAB+OCOhM41wFXBMR24CbM3M0Im4DVkbEEmAeb77bZ3X1WsFcYHl3hyBJmqpa5Z+ZNwI3TrBqw7jtXgOunmD86jr7lSRNDz/kJUkFsvwlqUCWvyQVyPKXpAJZ/pJUIMtfkgpk+UtSgdrxJTLSAWreJL5TqI6hvl37/L6ijdedPyP7Vhk885ekAln+klQgy1+SCmT5S1KBLH9JKpDlL0kFsvwlqUCWvyQVyPKXpAJZ/pJUIMtfkgpk+UtSgSx/SSqQ5S9JBbL8JalAlr8kFcgfc5E0JRP9gM1kfnxmOvgDNtPHM39JKpDlL0kFsvwlqUCWvyQVqLEXfCNiKRDAScBfZuamprJIUmkaKf+IOA04KzM/V12+FrisiSyS2mOidxpNVZ13Jh2M7zJqatrn94GHATLzcaC/oRySVKTIzP2/04irgWcz89bq+n9m5unjthkEBqur7wEeq7m7ucALdbM2oE1525QV2pW3TVmhXXnblBW6y3tyZh470Yqm5vy3AD1jrr88foPMXAWs6nZHEfFQZrbmmUWb8rYpK7Qrb5uyQrvytikrzFzepqZ9HgDOBoiI9wDrG8ohSUVq5Mw/M5+IiF9FxBeBU4CVTeSQpFI19lbPzLxpP+2q66mj/axNeduUFdqVt01ZoV1525QVZihvIy/4SpKa5Sd8JalAfqWzpK5ExKHAOcCuzPxl03k0OQdt+UfEUcASYBFw5YF+p4yI4+l8ynkrcB6dzP/VbKqJRcQs4KvAE8C5wHcy8+FmU+1dRMwGHszM9zWdZV8i4j7gterqY5k51GSevYmIdwGfB27JzCeazrM3EXExcPGYRUdl5sebyrM31QPqCmAzcDpwfWb+Zjr3cdCWf2a+BNwYEfNpx/TWOcAvMvO+iHiJzgPB8oYz7ck7gIcz8x8j4nFgGfBHDWfao4gI4FI6D6xtcEdm/kPTIfYlIg4BvgVcnpmjTefZl8y8A7gD/u9k66JmE+1VH3B8Zn4tIj4IfBb4q+ncwUFb/m2TmXePuToH+FVDUfYpM58EnqyunkTncxsHss8Ca4E/bDrIJM2PiFOAo4BvZ+aWpgPtwbl0vpxxaUTMA26ovq6lDS4Hvt90iL14FDg7Ik6n85moe6Z7B5b/ASYi3g6cAXyv6Sx7ExFz6Xwh35F0/iEdkCLiDODVzHyq8wTgwJeZfwYQER8BfgBc0GyiPToT2JCZ11UldQvwew1n2qeIOJrOWfVvm86yJ5n5SkRcCawGfgLcPN37aMN0SDEi4i3AVcCXM/PVpvPsTWa+kJlXALdyYJ9BLQI+GhHXAadGxHUR8c6mQ01GZv4UeHfTOfbiFeCI6vKv6UwHtsES4J+aDrE3EfFe4FPAWXS+AeGvp3sfnvkfIKr502XAdzNze9N59iYiPgT8T2Y+Amykcwc9IGXmit2XI+JDmXmgvo4CQEScD7yUmesiYg7wbNOZ9uIBOs/+AN5KC74srTrB+gNmoEyn2XnA45mZEXEH8KfTvYODtvyrfzifpvPCyeKIODoz/6XhWHtzBfAZ4MPV9MSOzLx4ryOaswX4k4jYBHwQ+HKzcQ4q/wp8KSJOpDP9d0XDefYoM0ci4mfVDzMdByxtOtMkLALuy8w3mg6yD3cCyyLiMuBU3nyQnTZ+wleSCuScvyQVyPKXpAJZ/pJUIMtfkgpk+UtSgSx/SSqQ5S9JBbL8JalA/wtfJU8z2ZF7QgAAAABJRU5ErkJggg==", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "attack = pd.merge(adult_pii, adult_data, left_on=['DOB'], right_on=['DOB'])\n", "attack['Name'].value_counts().hist();" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So it's not possible to re-identify a majority of individuals using *just* date of birth. What if we collect more information, to narrow things down further? If we use both date of birth and ZIP, we're able to do much better. In fact, we're able to uniquely re-identify basically the whole dataset." ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "scrolled": true, "tags": [ "hide-input" ] }, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAX8AAAD4CAYAAAAEhuazAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMCwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy86wFpkAAAACXBIWXMAAAsTAAALEwEAmpwYAAATHklEQVR4nO3df4xl5X3f8fdnAdNljdesF8Z2ZVjcaKPUgONlGkemtOMfpVIS/5DsImTUqmlhXbeQIH4EvAHEGouUpJAUXFNtlFKDt1iu1XQlQ12K4IKDi1NAFivLkGB7MeAC9WIYDWt+iW//uGfDZTzruXvnzlwvz/slrbjnOc+55/sddj/3zHPPnUlVIUlqy6pJFyBJWnmGvyQ1yPCXpAYZ/pLUIMNfkhpk+EtSgw6edAHDWL9+fW3YsGGkY5977jnWrFkz3oJ+wdlzG+y5DUvp+b777vtxVR250L4DIvw3bNjAvffeO9KxvV6PmZmZ8Rb0C86e22DPbVhKz0ke2dc+l30kqUGGvyQ1yPCXpAYZ/pLUIMNfkhpk+EtSgwx/SWqQ4S9JDTogPuS1FDsff5Z/ftHNEzn3rn/7mxM5ryQtxit/SWqQ4S9JDTL8JalBhr8kNcjwl6QGGf6S1CDDX5IaZPhLUoMMf0lqkOEvSQ0y/CWpQYa/JDXI8JekBi36Uz2TrAEuAf4a+CDw74CXgdOAWeDhqvpqkjcAW4FHgGOAS6vqpSSnAUcD64Abq+o7Sd4BnA08DrxUVV8Yf2uSpH0Z5kc6vw24v6q+kuSvgAuAI4CPV9VzSW5O8nX6LwYPVtUXk2wGTk1yC/DJqvpIkjcB24EPA1cAl1TVriQ3JLmlqnYtR4OSpJ+1aPhX1cPAw93m0cA99AP9uW7sB8C7gRngD7qx+4FPAU8CD3XPM5vk7UlWAccNhP0DwEnA3m0AuheQzQBTU1P0er39bg5gajWcd/zLIx27VKPWvFRzc3MTO/ek2HMb7Hl8hvplLknWA5cBq7v/fmRg9yxwFHBk93hfYwAvAGuBQxc4/jWqahuwDWB6erpmZmaGKfVnXLt9B1ftnMzvrNl1+sxEztvr9Rj163Wgsuc22PP4DPWGb1X9uKrOAm6kv65/+MDutcATwFPd432NQT/0nwGeX+B4SdIKWTT8k/x6knd1m7uA44Dd3RvBAMcCO4E7gOlubBNwJ3AfsLF7njcBP6qqAnYm2dDNPQG4e8mdSJKGNsx6yFPAbyd5HPg14BxgDtiS5Bng+qqaS/IlYGuSM4ANvHq3z01JzgPWAxd1z3kxcFaSR4F7quqH42xKkvTzDfOG7/fp3+oJ8B8Hdj0wb96LwGcWOP6mBcYeBS7cr0olSWPjh7wkqUGGvyQ1yPCXpAYZ/pLUIMNfkhpk+EtSgwx/SWqQ4S9JDTL8JalBhr8kNcjwl6QGGf6S1CDDX5IaZPhLUoMMf0lqkOEvSQ0y/CWpQYa/JDXI8JekBhn+ktQgw1+SGmT4S1KDDH9JatDBi01I8lbgTOBp4BTgXOATwMkD0z4NPAZcCjwFvBO4vKpmk3ygm3sQcFtV3ZXkzcDvA98H1lfV5WPrSJK0qEXDH3gv8K2qujXJHvovBA9W1W8NTkryIYCqui7JKcDZSa4ELgY+CAS4rZt3PvC1qrozyRVJTqqqu8fYlyTp51h02aeqdlTVrd3mOuDbwCFJtiT5fHdlDzAD3Ns9vr/b3gDsrr5XgD3A2/YxV5K0Qoa58gcgyVuA44DPV9UL3dhhwO1JPgEcCcx202eBo+aNDY6vo/9CMDg2/3ybgc0AU1NT9Hq9oZsaNLUazjv+5ZGOXapRa16qubm5iZ17Uuy5DfY8PkOFf5JDgAuBc/YGP0BV7UnyDeA99Nf613a71gJPzBsbHN8NrAHmBsZeo6q2AdsApqena2ZmZn/6+hvXbt/BVTuHfo0bq12nz0zkvL1ej1G/Xgcqe26DPY/Poss+SVYBFwBXV9WzSVYn+ezAlGOA7wJ3ANPd2CbgTuARYF36VgGrgSf3MVeStEKGuSQ+C/gY8L4k0F+m+T/dUs+xwC1V9XCS7wEnJzkT2Ej/bp+Xk3yO/pu+BwGXVdUrSa4GtiTZCOypqm+OvTNJ0j4tGv5VdQ1wzRDzCti6wPjtwO3zxp4Bfm/oKiVJY+WHvCSpQYa/JDXI8JekBhn+ktQgw1+SGmT4S1KDDH9JapDhL0kNMvwlqUGGvyQ1yPCXpAYZ/pLUIMNfkhpk+EtSgwx/SWqQ4S9JDTL8JalBhr8kNcjwl6QGGf6S1CDDX5IaZPhLUoMMf0lqkOEvSQ06eLEJSd4KnAk8DZwCnAusAU4DZoGHq+qrSd4AbAUeAY4BLq2ql5KcBhwNrANurKrvJHkHcDbwOPBSVX1h/K1JkvZl0fAH3gt8q6puTbKH/gvBrwIfr6rnktyc5Ov0XwwerKovJtkMnJrkFuCTVfWRJG8CtgMfBq4ALqmqXUluSHJLVe1ahv4kSQtYdNmnqnZU1a3d5jrgu8ARVfVcN/YD4N3ADHBvN3Z/t30i8FD3PLPA25OsAo4bCPsHgJOW2ogkaXjDXPkDkOQtwHHAZcA/G9g1CxwFHNk93tcYwAvAWuDQBY6ff77NwGaAqakper3esKW+xtRqOO/4l0c6dqlGrXmp5ubmJnbuSbHnNtjz+AwV/kkOAS4EzgF+Chw+sHst8ATwVPf40XljJw7MPRR4Bnh+3vGPzT9nVW0DtgFMT0/XzMzMMKX+jGu37+CqnUO/xo3VrtNnJnLeXq/HqF+vA5U9t8Gex2fRZZ9umeYC4OqqeraqXgR2J1nTTTkW2AncAUx3Y5uAO4H7gI3d87wJ+FFVFbAzyYZu7gnA3eNpR5I0jGEuic8CPga8Lwn0l2kuBLYkeQa4vqrmknwJ2JrkDGADr97tc1OS84D1wEXdc14MnJXkUeCeqvrhGHuSJC1i0fCvqmuAaxbY9cC8eS8Cn1ng+JsWGHuU/guIJGkC/JCXJDXI8JekBhn+ktQgw1+SGmT4S1KDDH9JapDhL0kNMvwlqUGGvyQ1yPCXpAYZ/pLUIMNfkhpk+EtSgwx/SWqQ4S9JDTL8JalBhr8kNcjwl6QGGf6S1CDDX5IaZPhLUoMMf0lqkOEvSQ0y/CWpQQcvNiHJYcAZwKnAuVX1l0kuBE4emPZp4DHgUuAp4J3A5VU1m+QD3dyDgNuq6q4kbwZ+H/g+sL6qLh9jT5KkRSwa/lW1B7gmySZe/U7hyar6rcF5ST7Uzb8uySnA2UmuBC4GPggEuK2bdz7wtaq6M8kVSU6qqrvH15Yk6ecZddnnkCRbkny+u7IHmAHu7R7f321vAHZX3yvAHuBt+5grSVohi175L6Sq/hT+Zkno9iSfAI4EZrsps8BR88YGx9fRfyEYHHuNJJuBzQBTU1P0er1RSmVqNZx3/MsjHbtUo9a8VHNzcxM796TYcxvseXxGCv+9qmpPkm8A76G/1r+227UWeGLe2OD4bmANMDcwNv+5twHbAKanp2tmZmakGq/dvoOrdi6pzZHtOn1mIuft9XqM+vU6UNlzG+x5fPZ72SfJ6iSfHRg6BvgucAcw3Y1tAu4EHgHWpW8VsBp4ch9zJUkrZJi7fdYBHwWOB04DDgd+0i31HAvcUlUPJ/kecHKSM4GN9O/2eTnJ5+i/6XsQcFlVvZLkamBLko3Anqr65rJ0J0la0DB3+zwNXN/92et/LTCvgK0LjN8O3D5v7Bng9/azVknSmPghL0lqkOEvSQ0y/CWpQYa/JDXI8JekBhn+ktQgw1+SGmT4S1KDDH9JapDhL0kNMvwlqUGGvyQ1yPCXpAYZ/pLUIMNfkhpk+EtSgwx/SWqQ4S9JDTL8JalBhr8kNcjwl6QGGf6S1CDDX5IaZPhLUoMOXmxCksOAM4BTgXOr6i+TnACcBswCD1fVV5O8AdgKPAIcA1xaVS8lOQ04GlgH3FhV30nyDuBs4HHgpar6wnI0J0la2KLhX1V7gGuSbOLV7xT+EPh4VT2X5OYkX6f/YvBgVX0xyWbg1CS3AJ+sqo8keROwHfgwcAVwSVXtSnJDkluqatcy9CdJWsB+L/skORQ4oqqe64Z+ALwbmAHu7cbu77ZPBB4CqKpZ4O1JVgHHDYT9A8BJo5UvSRrFolf+C1gHzA1szwJHAUd2j/c1BvACsBY4dIHjX6P77mEzwNTUFL1eb4RSYWo1nHf8yyMdu1Sj1rxUc3NzEzv3pNhzG+x5fEYJ/93A4QPba4EngKe6x4/OGztxYO6hwDPA8/OOf2z+SapqG7ANYHp6umZmZkYoFa7dvoOrdo7S5tLtOn1mIuft9XqM+vU6UNlzG+x5fPZ72aeqXgR2J1nTDR0L7ATuAKa7sU3AncB9wEaAbs3/R1VVwM4kG7q5JwB3j9qAJGn/DXO3zzrgo8DxwGlJDgcuBLYkeQa4vqrmknwJ2JrkDGADr97tc1OS84D1wEXd014MnJXkUeCeqvrhuBuTJO3bMHf7PA1c3/0Z9MC8eS8Cn1ng+JsWGHuU/guIJGkC/JCXJDXI8JekBhn+ktQgw1+SGmT4S1KDDH9JapDhL0kNMvwlqUGGvyQ1yPCXpAYZ/pLUIMNfkhpk+EtSgwx/SWqQ4S9JDTL8JalBhr8kNcjwl6QGGf6S1CDDX5IaZPhLUoMMf0lqkOEvSQ0y/CWpQQePemCSW4EXu82HgD8BzgYeB16qqi8kCXAp8BTwTuDyqppN8gHgZOAg4Laqumv0FiRJ+2vk8Af+S1X9570bSW4ELqmqXUluSHIL8EsAVXVdklOAs5NcCVwMfBAIcFuSD1XVK0uoRZK0H1JVox2YXAP8BDgM+CPgf1bVe7p95wP/F/gV4H9X1c1J1gM3AZ8G/qCq/kk392vAp6rq8XnPvxnYDDA1NXXil7/85ZHqfOrpZ3nypyMdumTH/+21Eznv3Nwcb3zjGydy7kmx5zbY8/55//vff19VTS+0b+Qr/6r6HYAkJwF/Bhw6sHsWOAo4snu8r7HB8deEf1VtA7YBTE9P18zMzEh1Xrt9B1ftXMo3OKPbdfrMRM7b6/UY9et1oLLnNtjz+Cz5Dd+qupv+8s7zA8NrgSfor/Wv/Tljg+OSpBUyUvgn+c0k7+8erwMeA3Ym2dBNOQG4G7gD2PstxybgTuARYF36VgGrgSdH7kCStN9GXQ/5C+B3k7wDOA44C9gDnJXkUeCeqvph9/jkJGcCG+nf7fNyks/Rf9P3IOAy3+yVpJU1UvhX1bPAZxfYdeG8eQVsXeD424HbRzm3JGnp/JCXJDXI8JekBhn+ktQgw1+SGmT4S1KDDH9JapDhL0kNMvwlqUGGvyQ1yPCXpAYZ/pLUIMNfkhpk+EtSgwx/SWqQ4S9JDTL8JalBhr8kNcjwl6QGGf6S1CDDX5IaZPhLUoMMf0lqkOEvSQ06eFInTnI2EOBo4I+r6vFJ1SJJrZnIlX+SjcCvVtU1wDbgsknUIUmtmtSyzz8A7geoqr8CpidUhyQ1aVLLPkcCjw1sHzp/QpLNwOZucy7JQyOeaz3w4xGPXZJcOYmzAhPseYLsuQ32vH+O2deOSYX/U8Dage2fzp9QVdvoLwktSZJ7q6qp7yzsuQ323Ibl6nlSyz53Ae8BSPLLwH0TqkOSmjSRK/+q+usk307yr4Fjga2TqEOSWjWxWz2r6toVOtWSl44OQPbcBntuw7L0nKpajueVJP0C8xO+ktQgw196nUhyRJLfSPLWSdeiX3yvm/BPcliS30nyF0l+bYH9b07yR0k+neSSSdQ4bkP0/LEkv931fV2STKLOcVqs54F5/yjJf1rJ2pbLMD0n+TDwT4FeVT2xshWO3xB/t38lyWeTnNv994D+u53krUkuSfJvkuxI8nfm7R97fr1uwr+q9nQ/LuJhFu7rfOBrVXUdsDrJSSta4DIYoue/D3y5qi4A/i7wyytZ33IYomeSvB3YtK/9B5rFek5yAvDeqrqmqvaseIHLYIj/z58C/kdVXU3/Q1BrF5hzIHkv8K2q+g/AfwfOnLd/7Pn1uvjHMaQZ4N7u8f3d9utaVZ1fVXs/QLcK2DXBclZEkoOAU4GvTLqWFXQhsDvJFd1V8CGTLmgF3AX8qyTrgf9XVc9MuJ4lqaodVXVrt7kO+Pa8KTOMOb9aCv91wN6rolngqAnWsqKSfAK4oaqen3QtK+BfAjcCLd3G9i7gK1W1BfhbwL+YcD3Lrqr+G/0fefAN4M4JlzM2Sd4CHAf8+bxdY8+vlsJ/N7Cme7wWOODXRYeR5F3A0VX1p5OuZbklOQL4h8AF9K+GT0zyu5OtakW8xKuf2ekB755cKSsjyRZgO/3lknOS/PqES1qy7ju2C4FzquqFebvHnl+v6/BPcmSSva+Qd/DqTw/dxOvoamHQYM9JNgD/GPjjiRa1zPb2XFU/qarTq+oi4Ergvqr695OubznM+7t9F68G/i8BD06mquU1r+ffAL5bVbPAfwX+3uQqW7okq+hftFxdVc92Y8uaXxP7hO+4JVkHfBQ4HjgtyeH0f3T0QcAW4GpgS/e7BPZU1TcnVuyYDNHzTcDTwAe6myH+vKr+bELljsUQPb/uDNHzFcAlXVAcDVw6qVrHZYie/xA4L8luYCMH/u8EOQv4GPC+7t/qLPA9ljG//ISvJDXodb3sI0lamOEvSQ0y/CWpQYa/JDXI8JekBhn+ktQgw1+SGmT4S1KD/j9KBE+PRAmJgAAAAABJRU5ErkJggg==", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "attack = pd.merge(adult_pii, adult_data, left_on=['DOB', 'Zip'], right_on=['DOB', 'Zip'])\n", "attack['Name'].value_counts().hist();" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When we use both pieces of information, we can re-identify **essentially everyone**. This is a surprising result, since we generally assume that many people share the same birthday, and many people live in the same ZIP code. It turns out that the *combination* of these factors is **extremely** selective. According to Latanya Sweeney's work {cite}`identifiability`, 87% of people in the US can be uniquely re-identified by the combination of date of birth, gender, and ZIP code." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's just check that we've actually re-identified *everyone*, by printing out the number of possible data records for each identity:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "tags": [ "hide-input" ] }, "outputs": [ { "data": { "text/plain": [ "Antonin Chittem 2\n", "Barnabe Haime 2\n", "Gwenny Penley 1\n", "Marvin Daubney 1\n", "Ursola Walasik 1\n", "Name: Name, dtype: int64" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "attack['Name'].value_counts().head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Looks like we missed two people! In other words, in this dataset, only **two people** share a combination of ZIP code and date of birth." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Aggregation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Another way to prevent the release of private information is to release only *aggregate* data." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "41.772181444058845" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "adult['Age'].mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Problem of Small Groups" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In many cases, aggregate statistics are broken down into smaller groups. For example, we might want to know the average age of people with a particular education level." ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
EducationAge
010th42.032154
111th42.057021
212th41.879908
\n", "
" ], "text/plain": [ " Education Age\n", "0 10th 42.032154\n", "1 11th 42.057021\n", "2 12th 41.879908" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "adult[['Education', 'Age']].groupby('Education', as_index=False).mean().head(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Aggregation is supposed to improve privacy because it's hard to identify the contribution of a particular individual to the aggregate statistic. But what if we aggregate over a group with just *one person* in it? In that case, the aggregate statistic reveals one person's age *exactly*, and provides no privacy protection at all! In our dataset, most individuals have a unique ZIP code - so if we compute the average age by ZIP code, then most of the \"averages\" actually reveal an individual's exact age." ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ZipAge
0472.0
11246.0
21638.0
31731.0
41840.0
\n", "
" ], "text/plain": [ " Zip Age\n", "0 4 72.0\n", "1 12 46.0\n", "2 16 38.0\n", "3 17 31.0\n", "4 18 40.0" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "adult[['Zip', 'Age']].groupby('Zip', as_index=False).mean().head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The US Census Bureau, for example, releases aggregate statistics at the [*block level*](https://www.census.gov/newsroom/blogs/random-samplings/2011/07/what-are-census-blocks.html). Some census blocks have large populations, but some have a population of zero! The situation above, where small groups prevent aggregation from hiding information about individuals, turns out to be quite common.\n", "\n", "How big a group is \"big enough\" for aggregate statistics to help? It's hard to say - it depends on the data and on the attack - so it's challenging to build confidence that aggregate statistics are really privacy-preserving. However, even very large groups do not make aggregation completely robust against attacks, as we will see next." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Differencing Attacks" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The problems with aggregation get even worse when you release multiple aggregate statistics over the same data. For example, consider the following two summation queries over large groups in our dataset (the first over the whole dataset, and the second over all records except one):" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1360144" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "adult['Age'].sum()" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1360088" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "adult[adult['Name'] != 'Karrie Trusslove']['Age'].sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we know both answers, we can simply take the difference and determine Karrie's age completely! This kind of attack can proceed even if the aggregate statistics are over *very large groups*." ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "56" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "adult['Age'].sum() - adult[adult['Name'] != 'Karrie Trusslove']['Age'].sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is a recurring theme.\n", "\n", "- Releasing *data* that is useful makes ensuring *privacy* very difficult\n", "- Distinguishing between *malicious* and *non-malicious* queries is not possible" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```{admonition} Summary\n", "- A *linkage attack* involves combining *auxiliary data* with *de-identified data* to *re-identify* individuals.\n", "- In the simplest case, a linkage attack can be performed via a *join* of two tables containing these datasets.\n", "- Simple linking attacks are surprisingly effective:\n", " - Just a single data point is sufficient to narrow things down to a few records\n", " - The narrowed-down set of records helps suggest additional auxiliary data which might be helpful\n", " - Two data points are often good enough to re-identify a huge fraction of the population in a particular dataset\n", " - Three data points (gender, ZIP code, date of birth) uniquely identify 87% of people in the US\n", "``` " ] } ], "metadata": { "celltoolbar": "Tags", "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.9" } }, "nbformat": 4, "nbformat_minor": 2 }