Friday, October 19, 2007

What is Privacy?

Today there was a joint CMU-Microsoft Research workshop on privacy, and it opened up with discussion about various definitions of privacy. 'Privacy' is a word we might use every day, but what do we mean by it? Suppose you had some sensitive piece of information in a large database. What sorts of information about the database would you feel comfortable being released to the public, without a 'privacy' violation?

Cynthia Dwork began with an interesting toy example: Suppose you want to keep your 'height' private, and there exists a database of heights. It seems innocuous enough to release the average height of Norwegian women. But suppose your worst enemy already knows that you are two inches taller than the average Norwegian woman -- then the release of average heights has 'violated your privacy'.



But has it really? In this case, the 'violation of your privacy' could have occurred whether or not your height was included in the database or not. This suggests an elegant definition of differential privacy that Cynthia and others have been working with: If the database statistics released would not differ whether or not you were in the database, your privacy has not been violated. Formally, we want that for any two neighboring databases D and D' (differing in only one element), for any statistic Q(D) that might be released, and for any value of that statistic, x:



Pr[Q(D') = x](1- epsilon) < Pr[Q(D) = x] < Pr[Q(D') = x](1+ epsilon)



where probability is taken only over internal randomness of the information-release mechanism (no assumptions about the distribution of the database elements are made). What this definition implies, is that if you are considering whether or not to include your information in the database, you shouldn't fear that your decision will help compromise your privacy -- anything that someone could learn from the database with your information in it, they could also learn without your information.



What is that epsilon parameter anyhow? In crypto, you would generally take epsilon to be something like 1/2^n -- something so tiny that the two distributions were basically identical. But you can't achieve that here if you want the database to be at all useful. Since we can transition between any two databases through a path of n neighboring databases (just swap out one at a time elements from your original database for elements of your target database), query probabilities differ by at most n*epsilon between any two databases, and if we want to be able to meaningfully distinguish any two databases, we need epsilon to be something like 1/n.

So this seems like a pretty good definition -- it cleanly gets to the heart of the matter, and it is parametrized to allow you to cleanly trade off between privacy and usefulness (and makes explicit that there must necessarily be such a trade off...)