Tuesday, November 27, 2007

Useful Privacy


A little while ago I wrote about Dwork et al.'s differential privacy as a good definition for characterizing what exactly we intend to convey by the word 'privacy'. Databases of 'private' information are very valuable sources of information (for example, we might want to be able to learn that smoking correlates to lung cancer, despite the fact that people want to keep their health records private). Therefore, we would like a definition of privacy that allows us to learn such facts about the distribution (that smoking correlates to lung cancer) but does not allow us to learn particulars about database entries (that Alice has lung cancer). epsilon-Differential privacy approaches this by guaranteeing that two databases that differ in only a single element yield answer-distributions that differ nowhere by more than exp(epsilon) when accessed by a privacy-preserving mechanism. But why not address the issue directly?

A database-access mechanism "A" preserves (epsilon,delta)-distributional privacy if for every distribution D, and any two databases D1, D2 with entries drawn from D, answer-distributions of A on D1 and D2 differ nowhere by more than exp(epsilon), except with probability delta. That is, if we imagine that databases (say, medical records of particular individuals) are drawn from some distribution (the set of all individuals), then we should only be able to learn things about the distribution, not about the individuals: We should get the same answers to any questions we might ask, from any database drawn from that distribution.

This is the privacy definition proposed in a recent paper by Avrim Blum, Katrina Ligett, and me. So how does it relate to differential privacy? On the one hand, it seems like we are making a stronger guarantee: Differential privacy guarantees close answer distributions only when databases differ on few elements, while our databases D1 and D2 could differ in all of their elements. On the other hand, it seems like we are claiming something weaker: we make the assumption that our databases are drawn from a distribution, and we allow privacy violations with some probability delta (which we generally think of as exponentially small). In fact, it turns out that distributional privacy is a strictly stronger guarantee than differential privacy -- with appropriate parameters, all distributionally-private mechanisms are also differentially-private, but not vice versa.


Of course, a stronger definition of privacy is no good if we can't do anything useful with it -- after all, we could insist that all database access mechanisms simply ignored the database, and answered questions randomly, but such a notion of privacy would not be useful. What do we mean by useful anyway?

Past work on privacy preserving data release doesn't seem to have addressed this, and has instead just tried to answer all queries with answers that are as close to the true answer as possible. But from learning theory, we get natural definitions of usefulness: Given a database D, we could say that an output database D' is (epsilon,delta)-useful with respect to a class of queries C, if with probability 1-delta, for every query q in C, |q(D) - q(D')| < epsilon.

By clarifying what we want to achieve by 'useful' data release, we make previous lower bounds on data release seem less scary. Results by Dinur and Nissim and Dwork et al. make release of static databases seem inconsistent with the goals of privacy, because they show it is possible to reconstruct almost the entire original database with a small number of almost-correct "subset sum" queries. But once we clarify what we mean by 'usefulness', its clear that we might still hope to statically release databases that are still 'useful' for a class of queries that does not include subset-sum queries, while still preserving privacy.

Well, maybe. In the paper, we consider the class of halfspace queries (a generalization to higher dimensions of queries of the type "what fraction of Americans are taller than 6'2''?"). We can show that its impossible to release a database useful for halfspace queries while preserving even differential-privacy. However, if we are willing to relax our definition of usefulness (With high probability, for every query q, |q(D') - q'(D)| < epsilon, for some query q' that is "close to" q), we can release a database "useful" for halfspace queries that preserves the stronger notion of distributional privacy.

So what is the take home message? Useful and private data release is still possible (despite strong impossibility results!) if we are willing to relax our definitions. We cannot hope to be simultaniously useful for all queries, but perhaps we can still be "useful" to many.

Sunday, November 11, 2007

That new facebook box

From the New York Times today:
Millions of people in this country -- particularly young people -- already have surrendered anonymity to social networking sites such as MySpace and Facebook, and to Internet commerce. These sites reveal to the public, government and corporations what was once closely guarded information, like personal statistics and credit card numbers.

After I put up my cell-phone number on my facebook profile, filling in the extra fields about my credit card number and where I hide my spare key seemed only natural. But I may take them down after reading this article.

Sunday, November 04, 2007

Smoothed Analysis of the Price of Anarchy

Given some system in which a lot of selfish agents are interacting (a game), we would like to predict what might happen. If we have some way of measuring the quality of an outcome, we might want to know how much worse an outcome the selfish players might light upon compared to the best solution possible. But what will the selfish players do? Typically, people assume that they will play according to a Nash equilibrium. This is a solution concept with lots of problems for computationally limited agents, that I have written about here. Lets just assume for a post that players do fall into Nash equilibrium (This isn't crazy in certain games (like potential games) in which natural iterative better-response play leads to a Nash equilibrium eventually.) Which Nash equilibrium will the agents end up in?

Typically we skirt this issue by defining the "price of anarchy" to be the ratio of the value of the worst equilibrium to the value of the optimal solution -- a pessimistic bound. Will we really end up in the worst Nash equilibrium?

A related problem comes up in the design of algorithms. How should we measure the running time of an algorithm? The worst case running time? Some notion of average case? Average case running time is misleading because our notion of 'average' probably won't coincide with the data's. But worst case running time can also be misleading if all worst-case inputs are somehow contrived and 'brittle' to small changes. Smoothed analysis is a compromise -- the smoothed complexity of an algorithm is measured by taking a worst case input, but then measuring the running time of the algorithm with respect to small perturbations of that worst-case input. The idea is, if the bad inputs are extremely brittle to perturbation, we are unlikely to actually encounter bad inputs.

So maybe we can do a smoothed analysis of the price of anarchy. In a recent paper, Christine Chung, Katrina Ligett, Kirk Pruhs, and I propose studying the stochastically stable states of a game -- the states that computationally simple agents, who make mistakes (noise) in their play with some small probability each turn. In general, the solution concept that this leads to can be completely different from the Nash equilibria of the game, but in potential games, it turns out that the stochastically stable states are a subset of the Nash equilibria. What does this mean? The stochastically stable states in these games are the Nash equilibria that are resilient to these occasional mistakes in play, and the price of anarchy taken over the worst case stochastically stable states ("the price of stochastic anarchy") can be viewed as a smoothed analysis of the price of anarchy -- The worst we are likely to encounter in a noisy, imperfect world.

Here's an example: Suppose Aaron and Bob have programs that they want to run on a computer, and there are two computers, X and Y. Aaron's program costs epsilon to run on X, and 1 to run on Y. Bob's program costs epsilon to run on Y, and 1 to run on X. Both players can choose what machine that they will run their programs on, and will pay the sum of the costs on the machine they picked. For example, if both Aaron and Bob pick machine X, they will both pay 1 + epsilon.

The optimal solution is obvious: Aaron plays on X, Bob plays on Y, and both pay epsilon. This is also a Nash equilibrium -- no player can unilaterally improve his payoff by switching machines (Bob could switch machines, but that would increase his cost from epsilon to 1 + epsilon).

If instead Aaron plays on Y and Bob plays on X, both players now pay 1, and this is again a Nash equilibrium, although a much worse one -- either player could switch machines, but because of congestion, would increase their cost from 1 to 1+epsilon.

So the price of anarchy of this game is 1/epsilon, which can be arbitrarily large. But is the bad equilibrium really realistic? If Aaron is playing on Y every turn, and Bob is playing on X, if Aaron makes a mistake one turn (and plays on X), then Bob can take advantage of this and increase his payoff by moving to X. The optimal equilibrium on the other hand is resilient to mistakes like this -- both players are already paying the smallest cost that they could ever pay in this game. It turns out that although the price of anarchy in the 2 player "load balancing" game that I just described is unbounded, but the price of stochastic anarchy is just 2. So the price of anarchy gave us an unrealistic measure of how bad things could get, because of the existence of bad, but brittle equilibria.

In our paper, we consider the general case of the n player m machine load balancing game, and prove that the price of stochastic anarchy is bounded. What other games could benefit from a 'smoothed' analysis of the price of anarchy?

UPDATE: Our paper is now available here.