Tuesday, November 27, 2007

Useful Privacy


A little while ago I wrote about Dwork et al.'s differential privacy as a good definition for characterizing what exactly we intend to convey by the word 'privacy'. Databases of 'private' information are very valuable sources of information (for example, we might want to be able to learn that smoking correlates to lung cancer, despite the fact that people want to keep their health records private). Therefore, we would like a definition of privacy that allows us to learn such facts about the distribution (that smoking correlates to lung cancer) but does not allow us to learn particulars about database entries (that Alice has lung cancer). epsilon-Differential privacy approaches this by guaranteeing that two databases that differ in only a single element yield answer-distributions that differ nowhere by more than exp(epsilon) when accessed by a privacy-preserving mechanism. But why not address the issue directly?

A database-access mechanism "A" preserves (epsilon,delta)-distributional privacy if for every distribution D, and any two databases D1, D2 with entries drawn from D, answer-distributions of A on D1 and D2 differ nowhere by more than exp(epsilon), except with probability delta. That is, if we imagine that databases (say, medical records of particular individuals) are drawn from some distribution (the set of all individuals), then we should only be able to learn things about the distribution, not about the individuals: We should get the same answers to any questions we might ask, from any database drawn from that distribution.

This is the privacy definition proposed in a recent paper by Avrim Blum, Katrina Ligett, and me. So how does it relate to differential privacy? On the one hand, it seems like we are making a stronger guarantee: Differential privacy guarantees close answer distributions only when databases differ on few elements, while our databases D1 and D2 could differ in all of their elements. On the other hand, it seems like we are claiming something weaker: we make the assumption that our databases are drawn from a distribution, and we allow privacy violations with some probability delta (which we generally think of as exponentially small). In fact, it turns out that distributional privacy is a strictly stronger guarantee than differential privacy -- with appropriate parameters, all distributionally-private mechanisms are also differentially-private, but not vice versa.


Of course, a stronger definition of privacy is no good if we can't do anything useful with it -- after all, we could insist that all database access mechanisms simply ignored the database, and answered questions randomly, but such a notion of privacy would not be useful. What do we mean by useful anyway?

Past work on privacy preserving data release doesn't seem to have addressed this, and has instead just tried to answer all queries with answers that are as close to the true answer as possible. But from learning theory, we get natural definitions of usefulness: Given a database D, we could say that an output database D' is (epsilon,delta)-useful with respect to a class of queries C, if with probability 1-delta, for every query q in C, |q(D) - q(D')| < epsilon.

By clarifying what we want to achieve by 'useful' data release, we make previous lower bounds on data release seem less scary. Results by Dinur and Nissim and Dwork et al. make release of static databases seem inconsistent with the goals of privacy, because they show it is possible to reconstruct almost the entire original database with a small number of almost-correct "subset sum" queries. But once we clarify what we mean by 'usefulness', its clear that we might still hope to statically release databases that are still 'useful' for a class of queries that does not include subset-sum queries, while still preserving privacy.

Well, maybe. In the paper, we consider the class of halfspace queries (a generalization to higher dimensions of queries of the type "what fraction of Americans are taller than 6'2''?"). We can show that its impossible to release a database useful for halfspace queries while preserving even differential-privacy. However, if we are willing to relax our definition of usefulness (With high probability, for every query q, |q(D') - q'(D)| < epsilon, for some query q' that is "close to" q), we can release a database "useful" for halfspace queries that preserves the stronger notion of distributional privacy.

So what is the take home message? Useful and private data release is still possible (despite strong impossibility results!) if we are willing to relax our definitions. We cannot hope to be simultaniously useful for all queries, but perhaps we can still be "useful" to many.

No comments: