# Exercises in Algorithm Design

## Contents

# Exercises in Algorithm Design#

## Issues to Consider#

How many queries are required, and what kind of composition can we use?

Is parallel composition possible?

Should we use sequential composition, advanced composition, or a variant of differential privacy?

Can we use the sparse vector technique?

Can we use the exponential mechanism?

How should we distribute the privacy budget?

If there are unbounded sensitivities, how can we bound them?

Would synthetic data help?

Would post-processing to “de-noise” help?

## 1. Generalized Sample and Aggregate#

Design a variant of sample and aggregate which does *not* require the analyst to specify the output range of the query function \(f\).

**Ideas**: use SVT to find good upper and lower bounds on \(f(x)\) for the whole dataset first. The result of \(clip(f(x), lower, upper)\) has bounded sensitivity, so we can use this query with SVT. Then use sample and aggregate with these upper and lower bounds.

## 2. Summary Statistics#

Design an algorithm to produce differentially private versions of the following statistics:

Mean: \(\mu = \frac{1}{n} \sum_{i=1}^n x_i\)

Variance: \(var = \frac{1}{n} \sum_{i=1}^n (x_i - \mu)^2\)

Standard deviation: \(\sigma = \sqrt{\frac{1}{n} \sum_{i=1}^n (x_i - \mu)^2}\)

**Ideas**:

**Mean**

Use SVT to find upper and lower clipping bounds

Compute noisy sum and count, and derive mean by post-processing

**Variance**

Split it into a count query (\(\frac{1}{n}\) - we have the answer from above) and a sum query

What’s the sensitivity of \(\sum_{i=1}^n (x_i - \mu)^2\)? It’s \(b^2\); we can clip and compute \(\sum_{i=1}^n (x_i - \mu)^2\), then multiply by (1) by post processing

**Standard Deviation**

Just take the square root of variance

Total queries:

Lower clipping bound (SVT)

Upper clipping bound (SVT)

Noisy sum (mean)

Noisy count

Noisy sum (variance)

## 3. Heavy Hitters#

Google’s RAPPOR system [15] is designed to find the most popular settings for Chrome’s home page. Design an algorithm which:

Given a list of the 10,000 most popular web pages by traffic,

Determines the top 10 most-popular home pages out of the 10,000 most popular web pages

**Ideas**: Use parallel composition and take the noisy top 10

## 4. Hierarchical Queries#

Design an algorithm to produce summary statistics for the U.S. Census. Your algorithm should produce total population counts at the following levels:

Census tract

City / town

ZIP Code

County

State

USA

**Ideas**:

Idea 1: *Only* compute the bottom level (census tract), using parallel composition. Add up all the tract counts to get the city counts, and so on up the hierarchy. Advantage: lowers privacy budget.

Idea 2: Compute counts at all levels, using parallel composition for each level. Tune the budget split using real data; probably we need more accuracy for the smaller levels of the hierarchy.

Idea 3: As (2), but also use post-processing to re-scale lower levels of the hierarchy based on higher ones; truncate counts to whole numbers; move negative counts to 0.

## 5. Workloads of Range Queries#

Design an algorithm to accurately answer a workload of *range queries*. Range queries are queries on a single table of the form: “how many rows have a value that is between \(a\) and \(b\)?” (i.e. the count of rows which lie in a specific range).

### Part 1#

The whole workload is pre-specified as a finite sequence of ranges: \(\{(a_1, b_1), \dots, (a_k, b_k)\}\), and

### Part 2#

The length of the workload \(k\) is pre-specified, but queries arrive in a streaming fashion and must be answered as they arrive.

### Part 3#

The workload may be infinite.

**Ideas**:

Just run each query with sequential composition.

For part 1, combine them so we can use \(L2\) sensitivity. When \(k\) is large, this will work well with Gaussian noise.

Or, build synthetic data:

For each range \((i, i+1)\), find a count (parallel composition). This is a synthetic data representation! We can answer infinitely many queries by adding up the counts of all the segments in this histogram which are contained in the desired interval.

For part 2, use SVT

For SVT: for each query in the stream, ask how far the real answer is from the synthetic data answer. If it’s far, query the real answer’s range (as a histogram, using parallel composition) and update the synthetic data. Otherwise just give the synthetic data answer. This way you *ONLY* pay for updates to the synthetic data.