Wednesday, October 30, 2019

FORC: A new conference you should know about.

Here is the CFP: https://responsiblecomputing.org/forc-2020-call-for-paper/

FORC 2020: CALL FOR PAPERS

Symposium on Foundations of Responsible Computing

The Symposium on Foundations of Responsible Computing (FORC) is a forum for mathematical research in computation and society writ large. The Symposium aims to catalyze the formation of a community supportive of the application of theoretical computer science, statistics, economics and other relevant analytical fields to problems of pressing and anticipated societal concern.

Important Dates

February 11: Submission Deadline
March 23: Notification to Authors
April 1: Camera Ready Deadline
June 1-3: The conference

Any mathematical work on computation and society is welcomed, including topics that are not yet well-established and topics that will arise in the future. This includes the investigation of definitions, algorithms and lower bounds, trade-offs, and economic incentives in a variety of areas. A small sample of topics follow: formal approaches to privacy, including differential privacy; fairness and discrimination in machine learning; bias in the formation of, and diffusion in, social networks; electoral processes and allocation of elected representatives (including redistricting).

The inaugural FORC will be held on June 1-3 at the Harvard Center for Mathematical Sciences and Applications (CMSA), and will have its proceedings published by LIPIcs. The program committee will review submissions to ensure a high quality program based on novel, rigorous and significant scientific contributions. Authors of accepted papers will have the option of publishing a 10-page version of their paper in the proceedings, or publishing only a 1-page extended abstract, to facilitate the publication of their work in another venue. 1-page abstracts will appear on the website, but not in the proceedings. The symposium itself will feature a mixture of talks by authors of accepted papers and invited talks.

Authors should upload a PDF of the paper through Easychair: https://easychair.org/conferences/?conf=forc2020. The font size should be at least 11 point and the paper should be formatted in a single column. Beyond these, there are no formatting or length requirements, but reviewers will only be asked to read the first 10 pages of the paper. It is the authors’ responsibility that the main results of the paper and their significance be clearly stated within the first 10 pages. Submissions should include proofs of all central claims, and the committee will put a premium on writing that conveys clearly and in the simplest possible way what the paper is accomplishing. Authors are free to post their paper on arXiv, etc. Future details will appear on the conference website: https://responsiblecomputing.org/.

Steering Committee

Avrim Blum
Cynthia Dwork
Sampath Kannan
Jon Kleinberg
Shafi Goldwasser
Kobbi Nissim
Toni Pitassi
Omer Reingold
Guy Rothblum
Salvatore Ruggieri
Salil Vadhan
Adrian Weller

Program Committee

Yiling Chen, Harvard
Rachel Cummings, Georgia Tech
Anupam Datta, Carnegie Mellon University
Moritz Hardt, UC Berkeley
Nicole Immorlica, Microsoft Research
Michael Kearns, University of Pennsylvania
Katrina Ligett, Hebrew University
Audra McMillan, Boston University and Northeastern
Aaron Roth, University of Pennsylvania (Chair)
Guy Rothblum, Weizmann Institute
Adam Smith, Boston University
Steven Wu, University of Minnesota
Jonathan Ullman, Northeastern
Jenn Wortman Vaughan, Microsoft Research
Suresh Venkatasubramanian, University of Utah
Nisheeth Vishnoi, Yale
James Zou, Stanford

Tuesday, September 10, 2019

A New Analysis of "Adaptive Data Analysis"

This is a blog post about our new paper, which you can read here: https://arxiv.org/abs/1909.03577

The most basic statistical estimation task is estimating the expected value of some predicate $q$ over a distribution $\mathcal{P}$: $\mathrm{E}_{x \sim \mathcal{P}}[q(x)]$, which I'll just write as $q(\mathcal{P})$. Think about estimating the mean of some feature in your data, or the error rate of a classifier that you have just trained. There's a really obvious way to come up with a good estimate if you've got a dataset $S \sim \mathcal{P}^n$ of $n$ points that were sampled i.i.d. from the distribution: just use the empirical mean $a = \frac{1}{n}\sum_{i=1}^n q(S_i)$! In fact, this is a great way to estimate the values of a really large number of predicates, so long as they were chosen non-adaptively: that is, so long as you came up with all of the predicates you wanted to estimate before you estimated any of the answers. This phenomenon is, for example, what classical generalization theorems in machine learning rely on: the empirical error of a set of classifiers in some model class will be a good estimate of their actual, out-of-sample error, so long as your dataset is at least as large as the logarithm of your model class.

But this guarantee breaks down if the predicates that you want to estimate are chosen in sequence, adaptively. For example, suppose you are trying to fit a machine learning model to data. If you train your first model, estimate its error, and then as a result of the estimate tweak your model and estimate its error again, you are engaging in exactly this kind of adaptivity. If you repeat this many times (as you might when you are tuning hyper-parameters in your model) you could quickly get yourself into big trouble. Of course there is a simple way around this problem: just don't re-use your data. The most naive baseline that gives statistically valid answers is called "data splitting". If you want to test k models in sequence, just randomly partition your data into k equal sized parts, and test each model on a fresh part. The holdout method is just the special case of k = 2. But this "naive" method doesn't make efficient use of data: its data requirements grow linearly with the number of models you want to estimate.

It turns out its possible to do better by perturbing the empirical means with a little bit of noise before you use them: this is what we (Dwork, Feldman, Hardt, Pitassi, Reingold, and Roth --- DFHPRR) showed back in 2014 in this paper, which helped kick off a small subfield known as "adaptive data analysis". In a nutshell, we proved a "transfer theorem" that says the following: if your statistical estimator is simultaneously differentially private and sample accurate --- meaning that with high probability it provides estimates that are close to the empirical means, then it will also be accurate out of sample. When paired with a simple differentially private mechanism for answering queries --- just perturbing their answers with Gaussian noise --- this gave a significant asymptotic improvement in data efficiency over the naive baseline! You can see how well it does in this figure (click to enlarge it):

Well... Hmm. In this figure, we are plotting how many adaptively chosen queries can be answered accurately as a function of dataset size. By "accurately" we have arbitrarily chosen to mean: answers that have confidence intervals of width 0.1 and uniform coverage probability 95%. On the x axis, we've plotted the dataset size n, ranging from 100,000 to about 12 million. And we've plotted two methods: the "naive" sample splitting baseline, and using the sophisticated Gaussian perturbation technique, as analyzed by the bound we proved in the "DFHPRR" paper. (Actually --- a numerically optimized variant of that bound!) You can see the problem. Even with a dataset size in the tens of millions, the sophisticated method does substantially worse than the naive method! You can extrapolate from the curve that the DFHPRR bound will eventually beat the naive bound, but it would require a truly enormous dataset. When I try extending the plot out that far my optimizer runs into numeric instability issues.

There has been improvement since then. In particular, DFHPRR didn't even obtain the best bound asymptotically. It is folklore that differentially private estimates generalize in expectation: the trickier part is to show that they enjoy high probability generalization bounds. This is what we showed in a sub-optimal way in DFHPRR. In 2015, a beautiful paper by Bassily, Nissim, Steinke, Smith, Stemmer, and Ullman (BNSSSU) introduced the amazing "monitor technique" to obtain the asymptotically optimal high probability bound. An upshot of the bound was that the Gaussian mechanism can be used to answer roughly $k = n^2$ queries --- a quadratic improvement over the naive sample splitting mechanism! You can read about this technique in the lecture notes of the adaptive data analysis class Adam Smith and I taught a few years back. Lets see how it does (click to enlarge):

Substantially better! Now we're plotting n from 100,000 up to about only 1.7 million. At this scale, the DFHPRR bound appears to be constant (at essentially 0), whereas the BNSSSU bound clearly exhibits quadratic behavior. It even beats the baseline --- by a little bit, so long as your dataset has somewhat more than a million entries... I should add that what we're plotting here is again a numerically optimized variant of the BNSSSU bound, not the closed-form version from their paper. So maybe not yet a practical technique. The problem is that the monitor argument --- while brilliant --- seems unavoidably to lead to large constant overhead.

Which brings us to our new work (this is joint work with Christopher Jung, Katrina Ligett, Seth Neel, Saeed Sharifi-Malvajerdi, and Moshe Shenfeld). We give a brand new proof of the transfer theorem. It is elementary, and in particular, obtains high probability generalization bounds directly from high probability sample-accuracy bounds, avoiding the need for the monitor argument. I think the proof is the most interesting part --- its simple and (I think) illuminating --- but an upshot is that we get substantially better bounds, even though the improvement is just in the constants (the existing BNSSSU bound is known to be asymptotically tight). Here's the plot with our bound included --- the x-axis is the same, but note the substantially scaled-up y axis (click to enlarge):

Proof Sketch
Ok: on to the proof. Here is the trick. In actuality, the dataset S is first sampled from $\mathcal{P}^n$, and then some data analyst interacts with a differentially private statistical estimator, resulting in some transcript $\pi$ of query answer pairs. But now imagine that after the interaction is complete, S is resampled from $Q_\pi = (\mathcal{P}^n)|\pi$, the posterior distribution on datasets conditioned on $\pi$. If you reflect on this for a moment, you'll notice that this resampling experiment doesn't change the joint distribution on dataset transcript pairs $(S,\pi)$ at all. So if the mechanism promised high probability sample accuracy bounds, it still promises them in this resampling experiment. But lets think about what that means: the mechanism can first commit to some set of answers $a_i$, and promise that with high probability, after S is resampled from $Q_\pi$, $|a_i - \frac{1}{n}\sum_{j=1}^n q_i(S_j)|$ is small. But under the resampling experiment, it is quite likely that the empirical value of the query $\frac{1}{n}\sum_{j=1}^n q_i(S_j)$ will end up being close to its expectation over the posterior: $q_i(Q_{\pi}) = \mathrm{E}_{S \sim Q_{\pi}}[\frac{1}{n}\sum_{j=1}^n q_i(S_j)]$. So the only way that a mechanism can promise high probability sample accuracy is if it actually promises high probability posterior accuracy: i.e. with high probability, for every query $q_i$ that was asked and answered, we must have that $|a_i - q_i(Q_\pi)|$ is small.

That part of the argument was generic --- it didn't use differential privacy at all! But it serves to focus attention on these posterior distributions $Q_\pi$ that our statistical estimator induces. And it turns out its not hard to see that the expected value of queries on posteriors induced by differentially private mechanisms have to be close to their true answers. For $(\epsilon,0)$-differential privacy, it follows almost immediately from the definition. Here is the derivation. Pick your favorite query $q$ and your favorite transcript $\pi$, and write $S_j \sim S$ to denote a uniformly randomly selected element of a dataset $S$:

$$q (Q_\pi) = \sum_{x} q (x) \cdot \Pr_{S \sim \mathcal{P}^n, S_j \sim S} [S_j = x | \pi]= \sum_{x} q (x) \cdot \frac{\Pr [\pi | S_j = x ] \cdot \Pr_{S \sim \mathcal{P}^n, S_j \sim S} [S_j = x]}{\Pr[\pi]}$$
$$\leq \sum_{x} q (x) \cdot \frac{e^\epsilon \Pr [\pi] \cdot \Pr_{S_j \sim \mathcal{P}} [S_j = x]}{\Pr[\pi]}
= e^\epsilon \cdot q (\mathcal{P})$$

Here, the inequality follows from the definition of differential privacy, which controls the effect that fixing a single element of the dataset to any value $(S_j = x)$ can have on the probability of any transcript: it can increase it multiplicatively by a factor of at most $e^\epsilon$.

And thats it: So we must have that (with probability 1!), $|q(Q_\pi) - q(\mathcal{P})| \leq e^\epsilon-1 \approx \epsilon$. The transfer theorem then follows from the triangle inequality. We get a high probability bound for free, with no need for any heavy machinery.

The argument is just a little more delicate in the case of $(\epsilon,\delta)$-differential privacy, and can be extended beyond linear queries --- but I think this gives the basic idea. The details are in our new "JLNRSS" paper. Incidentally, once nice thing about having many different proofs of the same theorem is that you can start to see some commonalities. One seems to be: it takes six authors to prove a transfer theorem!

Tuesday, May 28, 2019

Individual Notions of Fairness You Can Use

Our group at Penn has been thinking about when individual notions of fairness might be practically achievable for awhile, and we have two new approaches.

Background:
Statistical Fairness
I've written about this before, here. But briefly: there are two families of definitions in the fairness in machine learning literature. The first group of definitions, which I call statistical fairness notions, is far and away the most popular. If you want to come up with your own statistical fairness notion, you can follow this recipe:

Partition the world into a small number of "protected sub-groups". You will probably be thinking along the lines of race or gender or something similar when you do this.
Pick your favorite error/accuracy metric for a classifier. This might literally be classification error, or false positive or false negative rate, or positive predictive value, or something else. Lots of options here.
Ask that this metric be approximately equalized across your protected groups.
Finally, enjoy your new statistical fairness measure! Congratulations!

These definitions are far and away the most popular in this literature, in large part (I think) because they are so immediately actionable. Because they are defined as conditions on a small number of expectations, you can easily check whether your classifier is "fair" according to these metrics, and (although there are some interesting computational challenges) go and try and learn classifiers subject to these constraints.

Their major problem is related to the reason for their success: they are defined as conditions on a small number of expectations or averages over people, and so they don't promise much to particular individuals. I'll borrow an example from our fairness gerrymandering paper from a few years ago to put this in sharp relief. Imagine that we are building a system to decide who to incarcerate, and we want to be "fair" with respect to both gender (men and women) and race (green and blue people). We decide that in our scenario, it is the false positives who are harmed (innocent people sent to jail), and so to be fair, we decide should equalize the false positive rate: across men and women, and across greens and blues. But one way to do this is to jail all green men and blue women. This does indeed equalize the false positive rate (at 50%) across all four of the groups we specified, but is cold comfort if you happen to be a green man --- since then you will be jailed with certainty. The problem was our fairness constraint was never a promise to an individual to begin with, just a promise about the average behavior of our classifier over a large group. And although this is a toy example constructed to make a point, things like this happen in real data too.

Individual Fairness
Individual notions of fairness, on the other hand, really do correspond to promises made to individuals. There are at least two kinds of individual fairness definitions that have been proposed: metric fairness, and weakly meritocratic fairness. Metric fairness proposes that the learner will be handed a task specific similarity metric, and requires that individuals who are close together in the metric should have a similar probability of being classified as positive. Weakly meritocratic fairness, on the other hand, takes the (unknown) labels of an individual as a measure of merit, and requires that individuals who have a higher probability of really having a positive label should have only a higher probability of being classified as positive. This in particular implies that false positive and false negative rates should be equalized across individuals, where now the word rate is averaging over only the randomness of the classifier, not over people. What makes both of these individual notions of fairness is that they impose constraints that bind on all pairs of individuals and not just over averages of people.

Definitions like this have the advantage of strong individual-level semantics, which the statistical definitions don't have. But they also have big problems: for metric fairness, the obvious question is: where does the metric come from? Even granting that fairness should be some Lipschitz condition on a metric, it seems hard to pin down what the metric is, and different people will disagree: coming up with the metric seems to encapsulate a large part of the original problem of defining fairness. For weakly meritocratic fairness, the obvious problem is that we don't know what the labels are. Its possible to do non-trivial things if you make assumptions about the label generating process, but its not at all clear you can do any non-trivial learning subject to this constraint if you don't make strong assumptions.

Two New Approaches:
We have two new approaches, building off of metric fairness and weakly meritocratic fairness respectively. Both have the advantages of statistical notions of fairness in that they can be put into practice without making unrealistic assumptions about the data, and without needing to wait on someone to hand us a metric. But they continue to make meaningful promises to individuals.

Subjective Individual Fairness
Lets start with our variant of metric fairness, which we call subjective individual fairness. (This is joint work with Michael Kearns, our PhD students Chris Jung and Seth Neel, our former PhD student Steven Wu, and Steven's student (our grand student!) Logan Stapleton). The paper is here: https://arxiv.org/abs/1905.10660. We stick with the premise that "similar people should be treated similarly", and that whether or not it is correct/just/etc., it is at least fair to treat two people the same way, in the sense that we classify them as positive with the same probability. But we don't want to assume anything else.

Suppose I were to create a machine learning fairness panel: I could recruit "AI Ethics" experts, moral philosophers, hyped up consultants, people off the street, toddlers, etc. I would expect that there would be as many different conceptions of fairness as there were people on the panel, and that none of them could precisely quantify what they meant by fairness --- certainly not in the form of a "fairness metric". But I could still ask these people, in particular cases, if they thought it was fair that two particular individuals be treated differently or not.

Of course, I would have no reason to expect that the responses that I got from the different panelists would be consistent with one another --- or possibly even internally consistent (we won't assume, e.g. that the responses satisfy any kind of triangle inequality). Nevertheless, once we fix a data distribution and a group of people who have opinions about fairness, we have a well defined tradeoff we can hope to manage: any classifier we could choose will have both:

Some error rate, and
Some frequency with which it makes a pair of decisions that someone in the group finds unfair.

We can hope to find classifiers that optimally trade off 1 and 2: note this is a coherent tradeoff even though we haven't forced the people to try and express their conceptions of fairness into some consistent metric. What we show is that you can do this.

Specifically, given a set of pairs that we have determined should be treated similarly, there is an oracle efficient algorithm that can find the optimal classifier subject to the constraint that no pair of individuals that has been specified as a constraint should have a substantially different probability of positive classification. Oracle efficiency means that what we can do is reduce the "fair learning" problem to a regular old learning problem, without fairness constraints. If we can solve the regular learning problem, we can also solve the fair learning problem. This kind of fairness constraint also generalizes in the standard way: if you ask your fairness panel about a reasonably small number of pairs, and then solve the in-sample problem subject to these constraints, the classifier you learn will also satisfy the fairness constraints out of sample. And it works: we implement the algorithm and try it out on the COMPAS data set, with fairness constraints that we elicited from 43 human (undergrad) subjects. The interesting thing is that once you have an algorithm like this, it isn't only a tool to create "fair" machine learning models: its also a new instrument to investigate human conceptions of fairness. We already see quite a bit of variation among our 43 subjects in our preliminary experiments. We plan to pursue this direction more going forward.

Average Individual Fairness

Next, our variant of weakly meritocratic fairness. This is joint work with Michael Kearns and our student Saeed Sharifi. The paper is here: https://arxiv.org/abs/1905.10607. In certain scenarios, it really does seem tempting to think about fairness in terms of false positive rates. Criminal justice is a great example, in the sense that it is clear that everyone agrees on which outcome they want (they would like to be released from jail), and so the people we are being unfair to really do seem to be the false positives: the people who should have been released from jail, but who were mistakenly incarcerated for longer. So in our "fairness gerrymandering" example above, maybe the problem with thinking about false positive rates wasn't a problem with false positives, but with rates: i.e. the problem was that the word rate averaged over many people, and so it didn't promise you anything. Our idea is to redefine the word rate.

In some (but certainly not all) settings, people are subject to not just one, but many classification tasks. For example, consider online advertising: you might be shown thousands of targeted ads each month. Or applying for schools (a process that is centralized in cities like New York): you apply not just to one school, but to many. In situations like this, we can model the fact that we have not just a distribution over people, but also a distribution over (or collection of) problems.

Once we have a distribution over problems, we can define the error rate, or false positive rate, or any other rate you like for individuals. It is now sensible to talk about Alice's false positive rate, or Bob's error rate, because rate has been redefined as an average over problems, for a particular individual. So we can now ask for individual fairness notions in the spirit of the statistical notions of fairness we discussed above! We no longer need to define protected groups: we can now ask that the false positive rates, or error rates, be equalized across all pairs of people.

It turns out that given a reasonably sized sample of people, and a reasonably sized sample of problems, it is tractable to find the optimal classifier subject to constraints like this in sample, and that these guarantees generalize out of sample. The in-sample algorithm is again an oracle-efficient algorithm, or in other words, a reduction to standard, unconstrained learning. The generalization guarantee here is a little interesting, because now we are talking about simultaneous generalization in two different directions: to people we haven't seen before, and also to problems we haven't seen before. This requires thinking a little bit about what kind of object we are even trying to output: a mapping from new problems to classifiers. The details are in the paper (spoiler --- the mapping is defined by the optimal dual variables for the empirical risk minimization problem): here, I'll just point out that again, the algorithm is practical to implement, and we perform some simple experiments with it.

Tuesday, April 09, 2019

The Ethical Algorithm

I've had the good fortune to be able work on a number of research topics so far: including privacy, fairness, algorithmic game theory, and adaptive data analysis, and the relationship between all of these things and machine learning. As an academic, we do a lot of writing about the things we work on, but usually our audience is narrow and technical: other researchers in our sub-specialty. But it can be both fun and important to communicate to a wider audience as well. So my amazing colleague Michael Kearns and I wrote a book, called The Ethical Algorithm. It's coming out in October (Amazon says the release date is November 1st, but my understanding is that pre-orders will start shipping on October 4).

This was the first time for either of us writing something like this: it's not a textbook, it's a "trade book" -- a popular science book. Its intended readership isn't just computer science PhDs, but the educated public broadly. But there should be plenty in it to interest experts too, because we cover quite a bit of ground. The book is about the problems that arise when algorithmic decision making interacts with human beings --- and the emerging science about how to fix them. The topics we cover include privacy, fairness, strategic interactions and gaming, and the scientific reproducibility crisis.

Lots has been written about the problems that can arise, especially related to privacy and fairness. And we're not trying to reinvent the wheel. Instead, the focus of our book is the exciting and embryonic algorithmic science that has grown to address these issues.

The privacy chapter develops differential privacy, and its strengths and weaknesses. The fairness chapter covers recent work on algorithmic fairness that has come out of the computer science literature. The gaming chapter studies how algorithm design can affect the equilibria that emerge from large scale interactions. The reproducibility chapter explores the underlying issues that lead to false discovery, and recent algorithmic advances that hold promise in avoiding them. We try and set expectations appropriately. We don't pretend that the solutions to complex societal problems can be entirely (or even primarily) algorithmic. But we argue that embedding social values into algorithms will inevitably form an important component of any solution.

I'm really excited for it to come out. If you want, you can pre-order it now, either at Amazon: https://www.amazon.com/Ethical-Algorithm-Science-Socially-Design/dp/0190948205 or directly from the publisher: https://global.oup.com/academic/product/the-ethical-algorithm-9780190948207?cc=us&lang=en&#.XKegC9eWi2w.twitter

Tuesday, February 26, 2019

Impossibility Results in Fairness as Bayesian Inference

One of the most striking results about fairness in machine learning is the impossibility result that Alexandra Chouldechova, and separately Jon Kleinberg, Sendhil Mullainathan, and Manish Raghavan discovered a few years ago. These papers say something very crisp. I'll focus here on the binary classification setting that Alex studies because it is much simpler. There are (at least) three reasonable properties you would want your "fair" classifiers to have. They are:

False Positive Rate Balance: The rate at which your classifier makes errors in the positive direction (i.e. labels negative examples positive) should be the same across groups.
False Negative Rate Balance: The rate at which your classifier makes errors in the negative direction (i.e. labels positive examples negative) should be the same across groups.
Predictive Parity: The statistical "meaning" of a positive classification should be the same across groups (we'll be more specific about what this means in a moment)

What Chouldechova and KMR show is that if you want all three, you are out of luck --- unless you are in one of two very unlikely situations: Either you have a perfect classifier that never errs, or the base rate is exactly the same for both populations --- i.e. both populations have exactly the same frequency of positive examples. If you don't find yourself in one of these two unusual situations, then you have to give up on properties 1, 2, or 3.

This is discouraging, because there are good reasons to want each of properties 1, 2, and 3. And these aren't measures made up in order to formulate an impossibility result --- they have their root in the Propublica/COMPASS controversy. Roughly speaking, Propublica discovered that the COMPASS recidivism prediction algorithm violated false positive and negative rate balance, and they took the position that this made the classifier unfair. Northpointe (the creators of the COMPASS algorithm) responded by saying that their algorithm satisfied predictive parity, and took the position that this made the classifier fair. They were seemingly talking past each other by using two different definitions of what "fair" should mean. What the impossibility result says is that there is no way to satisfy both sides of this debate.

So why is this result true? The proof in Alex's paper can't be made simpler --- its already a one liner, following from an algebraic identity. But the first time I read it I didn't have a great intuition for why it held. Viewing the statement through the lens of Bayesian inference made the result very intuitive (at least for me). With this viewpoint, all the impossibility result is saying is: "If you have different priors about some event (say that a released inmate will go on to commit a crime) for two different populations, and you receive evidence of the same strength for both populations, then you will have different posteriors as well". This is now bordering on obvious --- because your posterior belief about an event is a combination of your prior belief and the new evidence you have received, weighted by the strength of that evidence.

Lets walk through this. Suppose we have two populations, call them $A$s and $B$s. Individuals $x$ from these populations have some true binary label $\ell(x) \in \{0,1\}$ which we are trying to predict. Individuals from the two populations are drawn from different distributions, which we'll call $D_A$ and $D_B$. We have some classifier that predicts labels $\hat\ell(x)$, and we would like it to satisfy the three fairness criteria defined above. First, lets define some terms:

The base rate for a population $i$ is just the frequency of positive labels:
$$p_i = \Pr_{x \sim D_i}[\ell(x) = 1].$$
The false positive and false negative rates of the classifier are:
$$FPR_i = \Pr_{x \sim D_i}[\hat\ell(x) = 1 | \ell(x) = 0] \ \ \ \ FNR_i = \Pr_{x \sim D_i}[\hat\ell(x) = 0 | \ell(x) = 1].$$
And the positive predictive value of the classifier is:
$$PPV_i = \Pr_{x \sim D_i}[\ell(x) = 1 | \hat\ell(x)=1].$$

Satisfying all three fairness constraints just means finding a classifier such that $FPR_A = FPR_B$, $FNR_A = FNR_B$, and $PPV_A = PPV_B$.

How should we prove that this is impossible? All three of these quantities are conditional probabilities, so we are essentially obligated to apply Bayes Rule:
$$PPV_i = \Pr_{x \sim D_i}[\ell(x) = 1 | \hat\ell(x)=1] = \frac{ \Pr_{x \sim D_i}[\hat\ell(x)=1 | \ell(x) = 1]\cdot \Pr_{x \sim D_i} [\ell(x) = 1]}{ \Pr_{x \sim D_i}[\hat \ell(x) = 1]}$$
But now these quantities on the right hand side are things we have names for. Substituting in, we get:
$$PPV_i = \frac{p_i(1-FNR_i)}{p_i(1-FNR_i) + (1-p_i)FPR_i}$$

And so now we see the problem. Suppose we have $FNR_A = FNR_B$ and $FPR_A = FPR_B$. Can we have $PPV_A = PPV_B$? There are only two ways. If $p_A = p_B$, then we are done, because the right hand side is the same for either $i \in \{A,B\}$. But if the base rates are different, then the only way to make these two quantities equal is if $FNR_i = FPR_i = 0$ --- i.e. if our classifier is perfect.

The piece of intuition here is that the base rate is our prior belief that $\ell(x) = 1$, before we see the output of the classifier. The positive predictive value is our posterior belief that $\ell(x) = 1$, after we see the output of the classifier. And all we need to know about the classifier in order to apply Bayes rule to derive our posterior from our prior is its false positive rate and its false negative rate --- these fully characterize the "strength of the evidence." Hence: "If our prior probabilities differ, and we see evidence of a positive label of the same strength, then our posterior probabilities will differ as well."

Once you realize this, then you can generalize the fairness impossibility result to other settings by making equally obvious statements about probability elsewhere. :-)

For example, suppose we generalize the labels to be real valued instead of binary --- so when making decisions, we can model individuals using shades of gray. (e.g. in college admissions, we don't have to model individuals as "qualified" or not, but rather can model talent as a real value.) Lets fix a model for concreteness, but the particulars are not important. (The model here is related to my paper with Sampath Kannan and Juba Ziani on the downstream effects of affirmative action)

Suppose that in population $i$, labels are distributed according to a Gaussian distribution with mean $\mu_i$: $\ell(x) \sim N(\mu_i, 1)$. For an individual from group $i$, we have a test that gives an unbiased estimator of their label, with some standard deviation $\sigma_i$: $\hat \ell(x) \sim N(\ell(x), \sigma_i)$.

In a model like this, we have analogues of our fairness desiderata in the binary case:

Analogue of Error Rate Balance: We would like our test to be equally informative about both populations: $\sigma_A = \sigma_B$.
Analogue of Predictive Parity: Any test score $t$ should induce the same posterior expectation on true labels across populations: $$E_{D_A}[\ell(x) | \hat \ell(x) = t] = E_{D_B}[\ell(x) | \hat \ell(x) = t]$$

Can we satisfy both of these conditions at the same time? Because the normal distribution is self conjugate (that's why we chose it!) Bayes Rule simplifies to have a nice closed form, and we can compute our posteriors as follows:

$$E_{D_i}[\ell(x) | \hat \ell(x) = t] = \frac{\sigma_i^2}{\sigma_i^2 + 1}\cdot \mu_i + \frac{1}{\sigma_i^2 + 1}\cdot t$$

So there are only two ways we can achieve both properties:

We can of course satisfy both conditions if the prior distributions are the same for both groups: $\mu_A = \mu_B$. Then we can set $\sigma_A = \sigma_B$ and observe that the right hand side of the above expression is identical for $i \in \{A, B\}$.
We can also satisfy both conditions if the prior means are different, but the signal is perfect: i.e. $\sigma_A = \sigma_B = 0$. (Then both posterior means are just $t$, independent of the prior means).

But we can see from inspection these are the only two cases. If $\sigma_A = \sigma_B$, but the prior means are different, then the posterior means will be different for every $t$. This is really the same impossibility result as in the binary case: all it is saying is that if I have different priors about different groups, but the evidence I receive has the same strength, then my posteriors will also be different.

So the mathematical fact is simple --- but its implications remain deep. It means we have to choose between equalizing a decision maker's posterior about the label of an individual, or providing an equally accurate signal about each individual, and that we cannot have both. Unfortunately, living without either one of these conditions can lead to real harm.

Saturday, January 26, 2019

Algorithmic Unfairness Without Any Bias Baked In

Discussion of (un)fairness in machine learning hit mainstream political discourse this week, when Representative Alexandria Ocasio-Cortez discussed the possibility of algorithmic bias, and was clumsily "called out" by Ryan Saavedra on twitter:

Socialist Rep. Alexandria Ocasio-Cortez (D-NY) claims that algorithms, which are driven by math, are racist pic.twitter.com/X2veVvAU1H — Ryan Saavedra (@RealSaavedra) January 22, 2019

It was gratifying to see the number of responses pointing out how wrong he was --- awareness of algorithmic bias has clearly become pervasive! But most of the pushback focused on the possibility of bias being "baked in" by the designer of the algorithm, or because of latent bias embedded in the data, or both:

You know algorithms are written by people right? And that the data they are trained on is made and selected by people? And that the problems algorithms solve are decided...again...by people? And that people can be and many times are racist? Ok now you do the math
— kade (@onekade) January 22, 2019

Bias in the data is certainly a problem, especially when labels are gathered by human beings. But its far from being the only problem. In this post, I want to walk through a very simple example in which the algorithm designer is being entirely reasonable, there are no human beings injecting bias into the labels, and yet the resulting outcome is "unfair". Here is the (toy) scenario -- the specifics aren't important. High school students are applying to college, and each student has some innate "talent" $I$, which we will imagine is normally distributed, with mean 100 and standard deviation 15: $I \sim N(100,15)$. The college would like to admit students who are sufficiently talented --- say one standard deviation above the mean (so, it would like to admit students with $I \geq 115$). The problem is that talent isn't directly observable. Instead, the college can observe grades $g$ and SAT scores $s$, which are a noisy estimate of talent. For simplicity, lets imagine that both grades and SAT scores are independently and normally distributed, centered at a student's talent level, and also with standard deviation 15: $g \sim N(I, 15)$, $s \sim N(I, 15)$.

In this scenario, the college has a simple, optimal decision rule: It should run a linear regression to try and predict student talent from grades and SAT scores, and then it should admit the students whose predicted talent is at least 115. This is indeed "driven by math" --- since we assumed everything was normally distributed here, this turns out to correspond to the Bayesian optimal decision rule for the college.

Ok. Now lets suppose there are two populations of students, which we will call Reds and Blues. Reds are the majority population, and Blues are a small minority population --- the Blues's only make up about 1% of the student body. But the Reds and the Blues are no different when it comes to talent: they both have the same talent distribution, as described above. And there is no bias baked into the grading or the exams: both the Reds and the Blues also have exactly the same grade and exam score distributions, as described above.

But there is one difference: the Blues have a bit more money than the Reds, so they each take the SAT twice, and report only the highest of the two scores to the college. This results in a small but noticeable bump in their average SAT scores, compared to the Reds. Here are the grades and exam scores for the two populations, plotted:

So what is the effect of this when we use our reasonable inference procedure? First, lets consider what happens when we learn two different regression models: one for the Blues, and a different one for the Reds. We don't see much difference:

The Red classifier makes errors approximately 11% of the time. The Blue classifier does about the same --- it makes errors about 10.4% of the time. This makes sense: the Blues artificially inflated their SAT score distribution without increasing their talent, and the classifier picked up on this and corrected for it. In fact, it is even a little more accurate!

And since we are interested in fairness, lets think about the false negative rate of our classifiers. "False Negatives" in this setting are the people who are qualified to attend the college ($I > 115$), but whom the college mistakenly rejects. These are really the people who have come to harm as a result of the classifier's mistakes. And the False Negative Rate is the probability that a randomly selected qualified person is mistakenly rejected from college --- i.e. the probability that a randomly selected student is harmed by the classifier. We should want that the false negative rates are approximately equal across the two populations: this would mean that the burden of harm caused by the classifier's mistakes is not disproportionately borne by one population over the other. This is one reason why the difference between false negative rates across different populations has become a standard fairness metric in algorithmic fairness --- sometimes referred to as "equal opportunity."

So how do we fare on this metric? Not so badly! The Blue model has a false negative rate of 50% on the blues, and the Red model has a false negative rate of 47% on the reds --- so the difference between these two is a satisfyingly small 3%.

But you might reasonably object: because we have learned separate models for the Blues and the Reds, we are explicitly making admissions decisions as a function of a student's color! This might sound like a form of discrimination, baked in by the algorithm designer --- and if the two populations represent e.g. racial groups, then its explicitly illegal in a number of settings, including lending.

So what happens if we don't allow our classifier to see group membership, and just train one classifier on the whole student body? The gap in false negative rates between the two populations balloons to 12.5%, and the overall error rate ticks up. This means if you are a qualified member of the Red population, you are substantially more likely to be mistakenly rejected by our classifier than if you are a qualified member of the Blue population.

What happened? There wasn't any malice anywhere in this data pipeline. Its just that the Red population was much larger than the Blue population, so when we trained a classifier to minimize its average error over the entire student body, it naturally fit the Red population --- which contributed much more to the average. But this means that the classifier was no longer compensating for the artificially inflated SAT scores of the Blues, and so was making a disproportionate number of errors on them --- all in their favor.

The combined admissions rule takes everyone above the black line. Since the Blues are shifted up relative to the Reds, they are admitted at a disproportionately higher rate.

This is the kind of thing that happens all the time: whenever there are two populations that have different feature distributions, learning a single classifier (that is prohibited from discriminating based on population) will fit the bigger of the two populations, simply because they contribute more to average error. Depending on the nature of the distribution difference, this can be either to the benefit or the detriment of the minority population. And not only does this not involve any explicit human bias, either on the part of the algorithm designer or the data gathering process, it is exacerbated if we artificially force the algorithm to be group blind. Well intentioned "fairness" regulations prohibiting decision makers form taking sensitive attributes into account can actually make things less fair and less accurate at the same time.

Thursday, January 10, 2019

2019 SIGecom Dissertation Award: Call for Nominations

Dear all,

Please consider nominating graduating Ph.D. students for the SIGecom Dissertation Award. If you are a graduating student, consider asking your adviser or other senior mentor to nominate you.

Nominations are due on February 28, 2019. This award is given to a student who defended a thesis in 2018. It is a prestigious award and is accompanied by a $1500 prize. In the past, the grand prize has been awarded to:

2017: Aviad Rubinstein, "Hardness of Approximation Between P and NP"
2016: Peng Shi, "Prediction and Optimization in School Choice"
2015: Inbal Talgam-Cohen, "Robust Market Design: Information and Computation "
2014: S. Matthew Weinberg, "Algorithms for Strategic Agents"
2013: Balasubramanian Sivan, "Prior Robust Optimization"

And the award has had seven runner-ups: Rachel Cummings, Christos Tzamos, Bo Waggoner, James Wright, Xi (Alice) Gao, Yang Cai, and Sigal Oren. You can find detailed information about the nomination process at: http://www.sigecom.org/awardd.html. We look forward to reading your nominations!

Your Award Committee,

Renato Paes Leme
Aaron Roth (Chair)
Inbal Talgam-Cohen