Design and Deployment#
Exercises in Algorithm Design#
Issues to Consider
How many queries are required, and what kind of composition can we use?
Is parallel composition possible?
Should we use sequential composition, advanced composition, or a variant of differential privacy?
Can we use the sparse vector technique?
Can we use the exponential mechanism?
How should we distribute the privacy budget?
If there are unbounded sensitivities, how can we bound them?
Would synthetic data help?
Would post-processing to “de-noise” help?
1. Generalized Sample and Aggregate#
Design a variant of sample and aggregate which does not require the analyst to specify the output range of the query function \(f\).
Tip
Use SVT to find good upper and lower bounds on \(f(x)\) for the whole dataset first. The result of \(clip(f(x), lower, upper)\) has bounded sensitivity, so we can use this query with SVT. Then use sample and aggregate with these upper and lower bounds.
2. Summary Statistics#
Design an algorithm to produce differentially private versions of the following statistics:
Mean: \(\mu = \frac{1}{n} \sum_{i=1}^n x_i\)
Variance: \(var = \frac{1}{n} \sum_{i=1}^n (x_i - \mu)^2\)
Standard deviation: \(\sigma = \sqrt{\frac{1}{n} \sum_{i=1}^n (x_i - \mu)^2}\)
Ideas:
Mean
Use SVT to find upper and lower clipping bounds
Compute noisy sum and count, and derive mean by post-processing
Variance
Split it into a count query (\(\frac{1}{n}\) - we have the answer from above) and a sum query
What’s the sensitivity of \(\sum_{i=1}^n (x_i - \mu)^2\)? It’s \(b^2\); we can clip and compute \(\sum_{i=1}^n (x_i - \mu)^2\), then multiply by (1) by post processing
Standard Deviation
Just take the square root of variance
Total queries:
Lower clipping bound (SVT)
Upper clipping bound (SVT)
Noisy sum (mean)
Noisy count
Noisy sum (variance)
3. Heavy Hitters#
Google’s RAPPOR system [18] is designed to find the most popular settings for Chrome’s home page. Design an algorithm which:
Given a list of the 10,000 most popular web pages by traffic,
Determines the top 10 most-popular home pages out of the 10,000 most popular web pages
Tip
Use parallel composition and take the noisy top 10
4. Hierarchical Queries#
Design an algorithm to produce summary statistics for the U.S. Census. Your algorithm should produce total population counts at the following levels:
Census tract
City / town
ZIP Code
County
State
USA
Tip
Idea 1: Only compute the bottom level (census tract), using parallel composition. Add up all the tract counts to get the city counts, and so on up the hierarchy. Advantage: lowers privacy budget.
Idea 2: Compute counts at all levels, using parallel composition for each level. Tune the budget split using real data; probably we need more accuracy for the smaller levels of the hierarchy.
Idea 3: As (2), but also use post-processing to re-scale lower levels of the hierarchy based on higher ones; truncate counts to whole numbers; move negative counts to 0.
5. Workloads of Range Queries#
Design an algorithm to accurately answer a workload of range queries. Range queries are queries on a single table of the form: “how many rows have a value that is between \(a\) and \(b\)?” (i.e. the count of rows which lie in a specific range).
Part 1#
The whole workload is pre-specified as a finite sequence of ranges: \(\{(a_1, b_1), \dots, (a_k, b_k)\}\), and
Part 2#
The length of the workload \(k\) is pre-specified, but queries arrive in a streaming fashion and must be answered as they arrive.
Part 3#
The workload may be infinite.
Ideas:
Just run each query with sequential composition.
For part 1, combine them so we can use \(L2\) sensitivity. When \(k\) is large, this will work well with Gaussian noise.
Or, build synthetic data:
For each range \((i, i+1)\), find a count (parallel composition). This is a synthetic data representation! We can answer infinitely many queries by adding up the counts of all the segments in this histogram which are contained in the desired interval.
For part 2, use SVT
For SVT: for each query in the stream, ask how far the real answer is from the synthetic data answer. If it’s far, query the real answer’s range (as a histogram, using parallel composition) and update the synthetic data. Otherwise just give the synthetic data answer. This way you ONLY pay for updates to the synthetic data.
Checklist for Privacy Deployment#
Deployment of differential privacy involves several steps to ensure the protection of sensitive data while still allowing useful analysis. Here’s a checklist for deploying a system with differential privacy:
Data Classification: Identify and classify sensitive data that needs protection. Understand the sensitivity levels and potential risks associated with each class of data.
Data Preparation: Preprocess the data to remove personally identifiable information (PII) and other sensitive attributes that are not relevant to the analysis.
Define Privacy Budget: Determine acceptable levels of utility or privacy risk for your deployment. This will govern the amount of noise added to the data to achieve differential privacy.
Data Analysis Requirements: Clearly outline the data analysis goals, as well as specific transformations and queries that need to be supported while maintaining privacy. Ensure that the chosen privacy mechanisms and privacy unit can accommodate these requirements.
Implement Differential Privacy: Choose appropriate differential privacy mechanisms such as Laplace mechanism, Gaussian mechanism, or other advanced techniques based on the analysis requirements. Implement these mechanisms into the data processing pipeline. Identify a strategy for privacy composition if applicable.
Noise Generation: Generate and introduce high-quality entropy to the data in accordance with the chosen differential privacy mechanism. Ensure that the randomness level is calibrated to achieve the desired privacy guarantee.
Testing and Verification: Conduct thorough testing to validate the effectiveness of the deployed differential privacy mechanisms. Test the system with a variety of queries and scenarios to ensure that privacy is preserved while maintaining data utility. Ideally, conduct tests on public or synthetic data.
Performance Evaluation: Evaluate the performance overhead introduced by the time and storage cost of differential privacy mechanisms. Monitor system metrics such as latency, throughput, and compute resource utilization to ensure acceptable levels of efficiency.
Documentation and Compliance: Document the deployment process, including the differential privacy mechanisms used, privacy parameters chosen, and any other relevant details. Ensure compliance with relevant privacy laws, regulations and standards.
Additional Security Measures: Implement all necessary security measures to protect against potential attacks or vulnerabilities. This may include encryption of data in transit and at rest, access controls, and auditing mechanisms. This may also include safeguards against side channel attacks such as isolated compute environments.
User Education and Training: Educate users and stakeholders about the principles of differential privacy, its implications, and the importance of preserving privacy while conducting data analysis.
Continuous Monitoring and Maintenance: Establish a process for ongoing monitoring and maintenance of the deployed system. Regularly review privacy parameters and update mechanisms as needed to adapt to evolving privacy requirements and threats.
By following this checklist, you can effectively deploy a system with differential privacy to protect sensitive data while enabling valuable analysis and insights!