Tuesday, September 10, 2019

A New Analysis of "Adaptive Data Analysis"

This is a blog post about our new paper, which you can read here: https://arxiv.org/abs/1909.03577 

The most basic statistical estimation task is estimating the expected value of some predicate $q$ over a distribution $\mathcal{P}$: $\mathrm{E}_{x \sim \mathcal{P}}[q(x)]$, which I'll just write as $q(\mathcal{P})$. Think about estimating the mean of some feature in your data, or the error rate of a classifier that you have just trained.  There's a really obvious way to come up with a good estimate if you've got a dataset $S \sim \mathcal{P}^n$ of $n$ points that were sampled i.i.d. from the distribution: just use the empirical mean $a = \frac{1}{n}\sum_{i=1}^n q(S_i)$! In fact, this is a great way to estimate the values of a really large number of predicates, so long as they were chosen non-adaptively: that is, so long as you came up with all of the predicates you wanted to estimate before you estimated any of the answers. This phenomenon is, for example, what classical generalization theorems in machine learning rely on: the empirical error of a set of classifiers in some model class will be a good estimate of their actual, out-of-sample error, so long as your dataset is at least as large as the logarithm of your model class.

But this guarantee breaks down if the predicates that you want to estimate are chosen in sequence, adaptively. For example, suppose you are trying to fit a machine learning model to data. If you train your first model, estimate its error, and then as a result of the estimate tweak your model and estimate its error again, you are engaging in exactly this kind of adaptivity. If you repeat this many times (as you might when you are tuning hyper-parameters in your model) you could quickly get yourself into big trouble. Of course there is a simple way around this problem: just don't re-use your data. The most naive baseline that gives statistically valid answers is called "data splitting". If you want to test k models in sequence, just randomly partition your data into k equal sized parts, and test each model on a fresh part. The holdout method is just the special case of k = 2.  But this "naive" method doesn't make efficient use of data: its data requirements grow linearly with the number of models you want to estimate. 

It turns out its possible to do better by perturbing the empirical means with a little bit of noise before you use them: this is what we (Dwork, Feldman, Hardt, Pitassi, Reingold, and Roth --- DFHPRR) showed back in 2014 in this paper, which helped kick off a small subfield known as "adaptive data analysis". In a nutshell, we proved a "transfer theorem" that says the following: if your statistical estimator is simultaneously differentially private and sample accurate --- meaning that with high probability it provides estimates that are close to the empirical means, then it will also be accurate out of sample. When paired with a simple differentially private mechanism for answering queries --- just perturbing their answers with Gaussian noise --- this gave a significant asymptotic improvement in data efficiency over the naive baseline! You can see how well it does in this figure (click to enlarge it): 
Well... Hmm. In this figure, we are plotting how many adaptively chosen queries can be answered accurately as a function of dataset size. By "accurately" we have arbitrarily chosen to mean: answers that have confidence intervals of width 0.1 and uniform coverage probability 95%. On the x axis, we've plotted the dataset size n, ranging from 100,000 to about 12 million. And we've plotted two methods: the "naive" sample splitting baseline, and using the sophisticated Gaussian perturbation technique, as analyzed by the bound we proved in the "DFHPRR" paper. (Actually --- a numerically optimized variant of that bound!) You can see the problem. Even with a dataset size in the tens of millions, the sophisticated method does substantially worse than the naive method! You can extrapolate from the curve that the DFHPRR bound will eventually beat the naive bound, but it would require a truly enormous dataset. When I try extending the plot out that far my optimizer runs into numeric instability issues. 

There has been improvement since then. In particular, DFHPRR didn't even obtain the best bound asymptotically. It is folklore that differentially private estimates generalize in expectation: the trickier part is to show that they enjoy high probability generalization bounds. This is what we showed in a sub-optimal way in DFHPRR. In 2015, a beautiful paper by Bassily, Nissim, Steinke, Smith, Stemmer, and Ullman (BNSSSU) introduced the amazing "monitor technique" to obtain the asymptotically optimal high probability bound. An upshot of the bound was that the Gaussian mechanism can be used to answer roughly $k = n^2$ queries --- a quadratic improvement over the naive sample splitting mechanism! You can read about this technique in the lecture notes of the adaptive data analysis class Adam Smith and I taught a few years back.  Lets see how it does (click to enlarge):
Substantially better! Now we're plotting n from 100,000 up to about only 1.7 million. At this scale, the DFHPRR bound appears to be constant (at essentially 0), whereas the BNSSSU bound clearly exhibits quadratic behavior. It even beats the baseline --- by a little bit, so long as your dataset has somewhat more than a million entries... I should add that what we're plotting here is again a numerically optimized variant of the BNSSSU bound, not the closed-form version from their paper. So maybe not yet a practical technique. The problem is that the monitor argument --- while brilliant --- seems unavoidably to lead to large constant overhead.  

Which brings us to our new work (this is joint work with Christopher Jung, Katrina Ligett, Seth Neel, Saeed Sharifi-Malvajerdi, and Moshe Shenfeld). We give a brand new proof of the transfer theorem. It is elementary, and in particular, obtains high probability generalization bounds directly from high probability sample-accuracy bounds, avoiding the need for the monitor argument. I think the proof is the most interesting part --- its simple and (I think) illuminating --- but an upshot is that we get substantially better bounds, even though the improvement is just in the constants (the existing BNSSSU bound is known to be asymptotically tight). Here's the plot with our bound included --- the x-axis is the same, but note the substantially scaled-up y axis (click to enlarge): 

Proof Sketch
Ok: on to the proof. Here is the trick. In actuality, the dataset S is first sampled from $\mathcal{P}^n$, and then some data analyst interacts with a differentially private statistical estimator, resulting in some transcript $\pi$ of query answer pairs. But now imagine that after the interaction is complete, S is resampled from $Q_\pi = (\mathcal{P}^n)|\pi$, the posterior distribution on datasets conditioned on $\pi$. If you reflect on this for a moment, you'll notice that this resampling experiment doesn't change the joint distribution on dataset transcript pairs $(S,\pi)$ at all. So if the mechanism promised high probability sample accuracy bounds, it still promises them in this resampling experiment. But lets think about what that means: the mechanism can first commit to some set of answers $a_i$, and promise that with high probability, after S is resampled from $Q_\pi$, $|a_i - \frac{1}{n}\sum_{j=1}^n q_i(S_j)|$ is small. But under the resampling experiment, it is quite likely that the empirical value of the query $\frac{1}{n}\sum_{j=1}^n q_i(S_j)$ will end up being close to its expectation over the posterior: $q_i(Q_{\pi}) = \mathrm{E}_{S \sim Q_{\pi}}[\frac{1}{n}\sum_{j=1}^n q_i(S_j)]$. So the only way that a mechanism can promise high probability sample accuracy is if it actually promises high probability posterior accuracy: i.e. with high probability, for every query $q_i$ that was asked and answered, we must have that $|a_i - q_i(Q_\pi)|$ is small.

That part of the argument was generic --- it didn't use differential privacy at all! But it serves to focus attention on these posterior distributions $Q_\pi$ that our statistical estimator induces. And it turns out its not hard to see that the expected value of queries on posteriors induced by differentially private mechanisms have to be close to their true answers. For $(\epsilon,0)$-differential privacy, it follows almost immediately from the definition. Here is the derivation. Pick your favorite query $q$ and your favorite transcript $\pi$, and write $S_j \sim S$ to denote a uniformly randomly selected element of a dataset $S$:
$$q (Q_\pi) =  \sum_{x} q (x) \cdot \Pr_{S \sim \mathcal{P}^n, S_j \sim S} [S_j = x | \pi]= \sum_{x} q (x) \cdot \frac{\Pr [\pi | S_j = x ] \cdot \Pr_{S \sim \mathcal{P}^n, S_j \sim S} [S_j = x]}{\Pr[\pi]}$$
$$\leq \sum_{x} q (x) \cdot \frac{e^\epsilon \Pr [\pi] \cdot \Pr_{S_j \sim \mathcal{P}} [S_j = x]}{\Pr[\pi]}
= e^\epsilon \cdot q (\mathcal{P})$$

Here, the inequality follows from the definition of differential privacy, which controls the effect that fixing a single element of the dataset to any value $(S_j = x)$ can have on the probability of any transcript: it can increase it multiplicatively by a factor of at most $e^\epsilon$. 

And thats it: So we must have that (with probability 1!), $|q(Q_\pi) - q(\mathcal{P})| \leq e^\epsilon-1 \approx \epsilon$. The transfer theorem then follows from the triangle inequality. We get a high probability bound for free, with no need for any heavy machinery.

The argument is just a little more delicate in the case of $(\epsilon,\delta)$-differential privacy, and can be extended beyond linear queries --- but I think this gives the basic idea. The details are in our new "JLNRSS" paper. Incidentally, once nice thing about having many different proofs of the same theorem is that you can start to see some commonalities. One seems to be: it takes six authors to prove a transfer theorem!