Friday, October 19, 2007

What is Privacy?

Today there was a joint CMU-Microsoft Research workshop on privacy, and it opened up with discussion about various definitions of privacy. 'Privacy' is a word we might use every day, but what do we mean by it? Suppose you had some sensitive piece of information in a large database. What sorts of information about the database would you feel comfortable being released to the public, without a 'privacy' violation?

Cynthia Dwork began with an interesting toy example: Suppose you want to keep your 'height' private, and there exists a database of heights. It seems innocuous enough to release the average height of Norwegian women. But suppose your worst enemy already knows that you are two inches taller than the average Norwegian woman -- then the release of average heights has 'violated your privacy'.



But has it really? In this case, the 'violation of your privacy' could have occurred whether or not your height was included in the database or not. This suggests an elegant definition of differential privacy that Cynthia and others have been working with: If the database statistics released would not differ whether or not you were in the database, your privacy has not been violated. Formally, we want that for any two neighboring databases D and D' (differing in only one element), for any statistic Q(D) that might be released, and for any value of that statistic, x:



Pr[Q(D') = x](1- epsilon) < Pr[Q(D) = x] < Pr[Q(D') = x](1+ epsilon)



where probability is taken only over internal randomness of the information-release mechanism (no assumptions about the distribution of the database elements are made). What this definition implies, is that if you are considering whether or not to include your information in the database, you shouldn't fear that your decision will help compromise your privacy -- anything that someone could learn from the database with your information in it, they could also learn without your information.



What is that epsilon parameter anyhow? In crypto, you would generally take epsilon to be something like 1/2^n -- something so tiny that the two distributions were basically identical. But you can't achieve that here if you want the database to be at all useful. Since we can transition between any two databases through a path of n neighboring databases (just swap out one at a time elements from your original database for elements of your target database), query probabilities differ by at most n*epsilon between any two databases, and if we want to be able to meaningfully distinguish any two databases, we need epsilon to be something like 1/n.

So this seems like a pretty good definition -- it cleanly gets to the heart of the matter, and it is parametrized to allow you to cleanly trade off between privacy and usefulness (and makes explicit that there must necessarily be such a trade off...)

3 comments:

Anonymous said...

Thanks for the interesting post.

tentaizu said...

What I want to know is, how often is my privacy compromised because someone computed a statistic on a database compared to how often my privacy is compromised because the database was leaked (in full or some large part) because of a lost laptop, a malicious intruder, or a company selling my information for profit?

While this may become more of an issue at some point, when I am deciding whether or not to include my information in a database I am going to be more concerned about a companies policies, my level of trust in their corporate security and technology practices; and much less in their differential database privacy.

Aaron said...

Tentaizu:

You make a good point -- differential privacy is a definition that helps to understand the privacy of our information when it is in the hands of a trusted entity, who is following 'the rules'. As a practical matter, we should probably be more worried about privacy violations at the hands of those who are not following the rules. But this is a hard problem -- if we believe that databases of sometimes sensitive information are useful things (and I think most people would agree that they are), then there will always be a risk of that information falling into the hands of evil-doers.

But even if our data hasn't fallen into the hands of the bad-guys, we'd like to make sure that the good-guys don't accidently release our private information, and this is what differential privacy is about. For example:

Check out this paper: http://www.cs.utexas.edu/~shmat/shmat_netflix-prelim.pdf -- These guys de-anonymize the Netflix Challenge database by cross-referencing movie ratings with the public IMDB database. Its easy to imagine that some of your movie watching preferences are non-sensitive (You enjoy foreign indie films), and these you post about publicly on IMDB. Other movie preferences might be a bit more sensitive, and these you only rate on NetFlix, imagining that they will be kept private. When Netflix releases an 'anonymized' dataset including your information, your privacy has been violated if Netflix did not properly understand what they were doing, and your sensitive information can be recovered. Other examples include AOL's famously released 'anonymized' search histories (From which the NYtimes was able to identify specific individual's searches), and federal HIPAA guidelines, which allow medical records to be released so long as 'unique identifiers' such as names, social security numbers, etc. are removed.

Its good for us to be aware of companies privacy policies, but we have to understand what privacy is in order to make and understand sensible policies. I'd be interested to know whether the privacy violations committed by Netflix and AOL violated their official policies, and if so, whether their privacy policies were previously considered reasonable.