Introduction#
This book is designed to give you an introduction to the challenges of data privacy, introduce you to the techniques that have been developed for addressing those challenges, and help you understand how to implement some of those techniques.
Structure & Design Philosophy#
The book contains numerous examples as programs, including implementations of many concepts. Each chapter is generated from a self-contained Jupyter Notebook.
We assume a working knowledge of Python, as well as basic knowledge of the Pandas and NumPy libraries. You will also benefit from some background in discrete mathematics and probability.
This book is open source, and the latest version will always be available online here. The source code is available on GitHub. If you would like to fix a typo, suggest an improvement, or report a bug, you can open an issue on GitHub.
Note
This book is multi-modal: there is a print version and also a more interactive web/html version which is hosted on https://programming-dp.com/.
Both versions are compiled from the same source code and manuscript. Hence, the reader can ignore any notes or comments referencing any kind of interactive UI (like the following two notes) if currently using a print version!
Note
(web) You can click on the “download” button at the top-right of the chapter, and then select “.ipynb” to download the notebook for that chapter, and you’ll be able to execute the examples yourself.
Note
(web) Many of the examples are generated by code that is hidden (for readability) in the chapters you’ll see here. You can show this code by clicking the “Click to show” labels adjacent to these cells.
Privacy#
The techniques described in this book have developed out of the study of data privacy. For our purposes, we will define data privacy this way:
Definition 1 (Data Privacy)
Data privacy techniques have the goal of allowing analysts to learn about trends in sensitive data, without revealing information specific to individuals.
This is a broad definition, and many different techniques fall under it. But it’s important to note what this definition excludes: techniques for ensuring security, like encryption. Encrypted data doesn’t reveal anything - so it fails to meet the first requirement of our definition. The distinction between security and privacy is an important one: privacy techniques involve an intentional release of information, and attempt to control what can be learned from that release; security techniques usually prevent the release of information, and control who can access data. This book covers privacy techniques, and we will only discuss security when it has important implications for privacy.
This book is primarily focused on differential privacy. The first couple of chapters outline some of the reasons why: differential privacy (and its variants) is the only formal approach we know about that seems to provide robust privacy protection. Commonly-used approaches that have been used for decades (like de-identification and aggregation) have more recently been shown to break down under sophisticated privacy attacks, and even more modern techniques (like