Tuesday, September 01, 2020

No Regret Algorithms from the Min Max Theorem

The existence of no-regret learning algorithms can be used to prove Von-Neumann's min-max theorem. This argument is originally due to Freund and Schapire, and I teach it to my undergraduates in my algorithmic game theory class. The min-max theorem also can be used to prove the existence of no-regret learning algorithms. Here is a constructive version of the argument (Constructive in that in the resulting algorithm, you only need to solve polynomially sized zero-sum games, so you can do it via linear programming).

Recall the setting. Play proceeds in rounds $t \in \{1,\ldots,T\}$. At each day $t$, the learner chooses one of $k$ actions $i_t \in \{1,\ldots,k\}$, and the adversary chooses a loss vector $\ell^t \in [0,1]^k$. The learner incurs loss $\ell^t_{i_t}$, corresponding to the action he chose. At the end of the interaction, the regret of the learner is defined to be the difference between the cumulative loss he incurred and the cumulative loss of the best fixed action (played consistently)  in hindsight:
$$\textrm{Regret}_T = \max_j \left(\sum_{t=1}^T \ell^t_{i_t} - \ell^t_j\right)$$ 
A classical and remarkable result is that there exist algorithms that can guarantee that regret grows only sublinearly with time: $\textrm{Regret}_T = O(\sqrt{T})$. Lets prove this. 

Define the non-negative portion of our cumulative regret with respect to action $j$ up until day $d$ as:
$$V_d^j = \left(\sum_{t=1}^d\left(\ell^t_{i_t} - \ell^t_j\right)\right)^+$$
and our additional regret at day $d+1$ with respect to action $j$ as:
$$r_{j}^{d+1} = \ell_{i_{d+1}}^{d+1} - \ell^{d+1}_j$$
Observe that if $V_{d}^j \geq 1$  then $V_{d+1}^j = V_d^j + r_j^{d+1}$. 

Define a surrogate loss function as our squared cumulative regrets, summed over all actions: 
$$L_d = \sum_{j=1}^k (V_d^j)^2$$
Observe that we can write the expected gain in our loss on day $d+1$, conditioned on the history thus far:
$$\mathbb{E}[L_{d+1} - L_d] \leq \sum_{j : V_d^j \geq 1} \mathbb{E}[(V_d^j+r_j^{d+1})^2 - (V_d^j)^2) ] + 3k$$
$$= \sum_{j : V_d^j \geq 1} \left(2V_d^j \mathbb{E}[r_{j}^{d+1}] + \mathbb{E}[(r_{j}^{d+1})^2]\right) + 3k $$
$$\leq \sum_{j=1}^k \left(2V_d^j \mathbb{E}[r_{j}^{d+1}]\right) + 4k$$
where the expectations are taken over the randomness of both the learner and the adversary in round $d+1$. 

Now consider a zero-sum game played between the learner and the adversary in which the learner is the minimization player, the adversary is the maximization player, and the utility function is $$u(i_{d+1}, \ell^{d+1}) = \sum_{j=1}^k \left(2V_d^j \mathbb{E}[r_{j}^{d+1}]\right)$$ The min-max theorem says that the learner can guarantee the same payoff for herself in the following two scenarios:

  1. The learner first has to commit to playing a distribution $p_{d+1}$ over actions $i$, and then the adversary gets to best respond by picking the worst possible loss vectors, or
  2. The adversary has to first commit to a distribution over loss vectors $\ell$ and then the learner gets the benefit of picking the best action $i_{d+1}$ to respond with. 
Scenario 1) is the scenario our learner finds herself in, when playing against an adaptive adversary. But 2) is much easier to analyze. If the adversary first commits to a distribution over loss vectors $\ell^{d+1}$, the learner can always choose action $i_{d+1} = \arg\min_j \mathbb{E}[\ell^{d+1}_j]$, which guarantees that $\mathbb{E}[r_{j}^{d+1}] \leq 0$, which in turn guarantees that the value of the game $ \sum_{j=1}^k \left(2V_d^j \mathbb{E}[r_{j}^{d+1}]\right) \leq 0$.  Hence, the min-max theorem tells us that the learner always has a distribution over actions $p_{d+1}$ that guarantees that $\mathbb{E}[L_{d+1} - L_d] \leq 4k$, even in the worst case over loss functions. If the learner always plays according to this distribution, then by a telescoping sum, we have that:
$$\mathbb{E}[L_T] \leq 4kT$$.
We therefore have by Jensen's inequality that:
$$\mathbb{E}[\max_j (V^j_T)] \leq \sqrt{\mathbb{E}[\max_j (V^j_T)^2]}\leq \sqrt{\mathbb{E}[\sum_{j=1}^k (V_j^T)^2]} \leq 2\sqrt{kT}$$.

Wednesday, August 19, 2020

Moment Multicalibration for Uncertainty Estimation

This blog post is about a new paper that I'm excited about, which is joint work with Chris Jung, Changhwa Lee, Mallesh Pai, and Ricky Vohra. If you prefer watching talks, you can watch one I gave to the Wharton statistics department here.

Suppose you are diagnosed with hypertension, and your doctor recommends that you take a certain drug to lower your blood pressure. The latest research, she tells you, finds that the drug lowers diastolic blood pressure by an average of 10 mm Hg. You remember your statistics class from college, and so you ask about confidence intervals. She looks up the paper, and tells you that it reports a 95% confidence interval of [5, 15]. How should you interpret this? 

What you might naively hope is that [5, 15] represents a conditional prediction interval. If you have some set of observable features $x$, and a label $y$ (in this case corresponding to your decrease in diastolic blood pressure after taking the drug), a 95% conditional prediction interval would promise that:
$$\Pr_y [y \in [5, 15] | x] \geq 0.95$$

In other words, a conditional prediction interval would promise that given all of your observed features, over the unrealized/unmeasured randomness of the world, there is a 95% chance that your diastolic blood pressure will decrease by between 5 and 15 points. 

But if you think about it, coming up with a conditional prediction interval is essentially impossible in a rich feature space. If $x$ contains lots of information about you, then probably there was nobody in the original study population that exactly matched your set of features $x$, and so we have no information at all about the conditional distribution on $y$ given $x$ --- i.e. no samples at all from the distribution over which our coverage probability supposedly holds! So how can you expect any sort of promise at all? There are two typical ways around this difficulty. 

The first is to make heroic assumptions about the data generation process. For example, if we assume that the world looks like an ordinary least squares model, and that there is a linear relationship between $y$ and $x$, then we can form a confidence region around the parameters of the model, and from that derive prediction intervals. But these prediction intervals are not valid if the model fails to hold, which it inevitably will. 

The second is to give up on conditional prediction intervals, and instead give marginal prediction intervals. This is what the conformal prediction literature aims to do. A marginal prediction interval looks quite similar to a conditional prediction interval (at least syntactically), and promises:
$$\Pr_{(x,y)} [y \in [5, 15] ] \geq 0.95$$

Rather than conditioning on your features $x$, a marginal prediction interval averages over all people, and promises that 95% of people who take the drug have their diastolic blood pressure lowered by between 5 and 15 points. But the semantics of this promise are quite different than that of a conditional prediction interval. Because the average is now taken over a large, heterogeneous population, very little is promised to you. For example, it might be that for patients in your demographic group (e.g. middle aged women with Sephardic Jewish ancestry and a family history of diabetes) that the drug is actually expected to raise blood pressure rather than lower it. Because this subgroup represents less than 5% of the population, it is entirely consistent with the marginal prediction interval being correct. Of course, if you are lucky, then perhaps someone has conducted a study of people from this demographic group and has computed marginal prediction intervals over it! But what if there are multiple different groups that you are a member of, over which the results seem to conflict? For example, you might also have a low BMI value and have unusually good cholesterol readings --- features of a group for which the drug works unusually well. Which uncertainty estimate should you trust, if you are a member of both groups? 

These concerns actually arise already when we think about the semantics of mean estimations ("the expected drop in blood pressure amongst patients who take this drug is 10 mm Hg"). Ideally, if you were a patient with features $x$, then 10 would be an estimate of $\mathbb{E}[y | x]$. But just as with uncertainty estimation, in a large feature space, we typically have no information about the distribution on $y$ conditional on $x$ (because we have never met anyone exactly like you before), and so instead what we have is just an estimate of $\mathbb{E}[y]$ --- i.e. averaging over people. If you have a method of making predictions $f(x)$ as a function of features $x$, then a standard performance metric is calibration --- which informally asks that for every prediction $p$, amongst all people for whom we predicted $f(x) = p$, the average of the realized labels $y$ should be $p$. Again, estimates of this form promise little to individuals, because they are averages over a large and heterogeneous population.   

Several years ago, Hebert-Johnson et al. proposed a nice way to interpolate between the (impossible) ideal of offering conditional mean predictions  $f(x) = \mathbb{E}[y | x]$, and the weak guarantee of merely offering calibrated predictions $f$. Roughly speaking, they proposed to specify a very large collection of potentially intersecting groups $G$ (representing e.g. demographic groups like Sephardic Jewish women with a family history of diabetes, and hypertensive patients with low cholesterol and BMI values, etc) and to ask that a trained predictor be simultaniously calibrated on each sufficiently large group in $G$. They showed how to accomplish this using a polynomially sized sample from the underlying distribution, with polynomial running time overhead, on top of the cost of solving learning problems over $G$. 

In our paper, we --- roughly speaking --- show how to accomplish the same thing, but for variances and other higher moments, in addition to just means. And our "multicalibrated moment estimates" can be used to construct prediction intervals in exactly the same way that real moments of the conditional label distribution could be used. If you used the real (unknown) label distribution moments, you would have gotten conditional prediction intervals. If you use our multi-calibrated moments, you get marginal prediction intervals that are simultaneously valid as averaged over each of the groups in $G$. So, for example, our hypertensive patient above could interpret her prediction interval --- if it was constructed from multicalibrated moment estimates computed from her features --- as an average over each of the demographic groups that she is a member of (so long as they are contained within $G$), and all of those interpretations would be simultaneously valid. 

I'll leave the details to the paper --- including what exactly we mean by "moment multicalibration". I'll just note that a major difficulty is that variances and higher moments --- unlike expectations --- do not combine linearly, so it is no longer sensible to ask that "amongst all people for whom we predicted variance v, the true variance should be v" --- because even the true conditional label variances do not satisfy this property. But it is sensible to ask that a pair of mean and moment predictions be calibrated in this way: "amongst all people for whom we predicted mean $\mu$ and variance v, the true mean should be $\mu$ and the true variance should be $v$." This is what we call "mean-conditioned moment calibration", and it is satisfied by the true distributional moments. 

Friday, June 05, 2020

TCS Visioning Workshop — Call for Participation

Reposting from here: https://thmatters.wordpress.com/2020/06/05/tcs-visioning-workshop-call-for-participation/

The CATCS will be hosting a virtual “Visioning Workshop” the week of July 20 in order to identify broad research themes within theoretical computer science (TCS) that have potential for a major impact in the future. The goals are similar to the workshop of the same name in 2008: to package these themes in a way that can be consumed by the general public, which we would deliver primarily to the Computing Community Consortium and others (e.g. funding agencies) to help them advocate for TCS.
While participation in the workshop is primarily through invitation, we have a few slots available for the broader community. If you are interested in participating, please see details of the application process below. The workshop will be organized according to area-wise breakout groups. Each breakout group will have 1-2 leads. Breakout groups will meet for 4-5 hours spread across several days and will be tasked with brainstorming ideas and preparing materials related to their topic. Leads are further expected to participate in plenary sessions held on Monday July 20 and Friday July 24 (4-5 hrs of additional time) where these materials will be discussed.
If you are interested in participating in the workshop, please fill out this Google form by Monday June 15. On this form, applicants are asked to contribute one or two major results in the last 10 years whose significance can be explained in layperson terms, and one or two major challenges for theory whose significance can be explained in layperson terms. These descriptions can be very brief. We will just use them to select participants and create breakout groups.

Saturday, May 16, 2020

FORC 2020 Program

The FORC 2020 program is now available here: https://responsiblecomputing.org/program/ (and reproduced below) Note the terrific set of contributed talks at keynotes, and note that registration is free! There will also be junior/senior breakout sessions into small groups during the lunch breaks to facilitate informal conversation and networking. Please consider attending, and help to spread the word!

FORC 2020 will take place virtually (on Zoom) on June 1st and 2nd, 2020. All times below are in Eastern Daylight Time (i.e. Boston/New York/Philadelphia time).

Registration is FREE but mandatory (Zoom links will be sent to registered email addresses). Register here before May 28: https://forms.gle/vgTzkMS6jLN9xPud8

Day 1: June 1, 2020. 
10:20-10:30 Opening Remarks
10:30-11:20 Keynote TalkAdrian Weller, University of Cambridge (Session Chair: Suresh Venkatasubramanian)
11:20-12:00 Session 1 (Session Chair: Adam Smith)
  1. Aloni Cohen and Kobbi Nissim. Towards Formalizing the GDPR’s Notion of Singling Out
  2. Charlotte Peale, Omer Reingold and Katrina Ligett. Bounded-Leakage Differential Privacy
  3. Moni Naor and Neil Vexler. Can Two Walk Together: Privacy Enhancing Methods and Preventing Tracking of Users
  4. Badih Ghazi, Ravi Kumar, Pasin Manurangsi and Rasmus Pagh. Private Counting from Anonymous Messages: Near-Optimal Accuracy with Vanishing Communication Overhead
  5. Badih Ghazi, Noah Golowich, Ravi Kumar, Rasmus Pagh and Ameya Velingker. On the Power of Multiple Anonymous Messages
12:00-1:30 Virtual Lunch/Informal Discussion/Networking
1:30-2:20 Keynote Talk: Rakesh Vohra, University of Pennsylvania (Session Chair: Rachel Cummings)
2:20-3:05 Session 2 (Session Chair: Nicole Immorlica)
  1. Lee Cohen, Zachary Lipton and Yishay Mansour. Efficient candidate screening under multiple tests and implications for fairness
  2. Cynthia Dwork, Christina Ilvento, Guy Rothblum and Pragya Sur. Abstracting Fairness: Oracles, Metrics, and Interpretability
  3. Christina Ilvento. Metric Learning for Individual Fairness
  4. Kevin Stangl and Avrim Blum. Recovering from Biased Data: Can Fairness Constraints Improve Accuracy?
End Day 1
Day 2: June 2, 2020.
10:30-11:20 Keynote TalkPatricia Williams, Northeastern University (Session Chair: Cynthia Dwork)
11:20-12:00 Session 3 (Session Chair: Steven Wu)
  1. Arpita Biswas, Siddharth Barman, Amit Deshpande and Amit Sharma. Inframarginality Audit of Group-Fairness
  2. Christopher Jung, Sampath Kannan and Neil Lutz. Service in Your Neighborhood: Fairness in Center Location
  3. Cynthia Dwork, Christina Ilvento and Meena Jagadeesan. Individual Fairness in Pipelines
  4. Hao Wang, Hsiang Hsu, Mario Diaz and Flavio Calmon. To Split or Not to Split: The Impact of Disparate Treatment in Classification
12:00-1:30 Virtual Lunch/Informal Discussion/Networking
1:30-2:20 Keynote Talk: Jon Kleinberg, Cornell University. (Session Chair: Michael Kearns)
2:20-3:05 Session 4 (Session Chair: Guy Rothblum)
  1. Ashesh Rambachan and Jonathan Roth. Bias In, Bias Out? Evaluating the Folk Wisdom
  2. Mark Braverman and Sumegha Garg. The Role of Randomness and Noise in Strategic Classification
  3. Sergei Mikhalishchev and Andrei Matveenko. Attentional Role of Quota Implementation
  4. Roy Dong, Erik Miehling and Cedric Langbort. Protecting Consumers Against Personalized Pricing: A Stopping Time Approach
3:05-3:15 Closing Remarks.
End of Conference

Sunday, March 29, 2020

FORC 2020 Accepted Papers

When we announced the new conference FORC (Foundations of Responsible Computing) we really had no idea what kind of papers folks would send us.

Fortunately, we got a really high quality set of submissions, from which we have accepted the papers that will make up the program of the inaugural FORC. Check out the accepted papers here: https://responsiblecomputing.org/accepted-papers/

Monday, February 17, 2020

Fair Prediction with Endogenous Behavior

Can Game Theory Help Us Choose Among Fairness Constraints?

This blog post is about a new paper, joint with Christopher Jung, Sampath Kannan, Changhwa Lee, Mallesh M. Pai, and Rakesh Vohra.

A lot of the recent boom in interest in fairness in machine learning can be traced back to the 2016 Propublica article Machine Bias. To summarize what you will already know if you have interacted with the algorithmic fairness literature at all --- Propublica discovered that the COMPAS recidivism prediction instrument (used to inform bail and parole decisions by predicting whether individuals would go on to commit violent crimes if released)  made errors of different sorts on different populations. The false positive rate (i.e. the rate at which it incorrectly labeled people "high risk") was much higher on the African American population than on the white population, and the false negative rate (i.e. the rate at which it incorrectly labeled people as "low risk") was much higher on the white population. Because being falsely labeled high risk is harmful (it decreases the chance you are released), this was widely and reasonably viewed as unfair.

But the story wasn't so simple. Northpointe, the company that produced COMPAS (They have since changed their name) responded by pointing out that their instrument satisfied predictive parity across the two populations --- i.e. that the positive predictive value of their instrument was roughly the same for both white and African American populations. This means that their predictions conveyed the same meaning across the two populations: the people that COMPAS predicted were high risk had roughly the same chance of recidivating, on average, whether or not they were black or white. This is also desirable, because if we use an instrument that produces predictions whose meanings differ according to an individual's demographic group, then we are explicitly incentivizing judges to make decisions based on race, after they are shown the prediction of the instrument. Of course, we now know that simultaneously equalizing false positive rates, false negative rates, and positive predictive values across populations is generically impossible --- i.e. it is impossible except under very special conditions, such as when the underlying crime rate is exactly the same in both populations. This follows from thinking about Bayes Rule.

Another sensible notion of fairness suggests that "similarly risky people should be treated similarly". This harkens back to notions of individual fairness, and suggests that we should do something like the following: we should gather as much information about an individual as we possibly can, and condition on all of it to find a (hopefully correct) posterior belief that they will go on to commit a crime. Then, we should make incarceration decisions by subjecting everyone to the same threshold on these posterior beliefs --- any individual who crosses some uniform threshold should be incarcerated; anyone who doesn't cross the threshold should not be. This is the approach that Corbett-Davies and Goel advocate for, and it seems to have a lot going for it. In addition to uniform thresholds feeling fair, its also easy to see that doing this is the Bayes-optimal decision rule to optimize any societal cost function that differently weights the cost of false positives and false negatives. But applying a uniform threshold on posterior distributions unfortunately will generally result in a decision rule that neither equalizes false positive and false negative rates, nor positive predictive value. Similarly, satisfying these other notions of fairness will generally result in a decision rule that is sub-optimal in terms of its predictive performance.

Unfortunately, this leaves us with little guidance --- should we aim to equalize false positive and negative rates (sometime called equalized odds in this literature)? Should we aim to equalize positive predictive value? Or should we aim for using uniform thresholds on posterior beliefs? Should we aim for something else entirely? More importantly, by what means should we aim to make these decisions?

A Game Theoretic Model

One way we can attempt to choose among different fairness "solution concepts" is to try and think about the larger societal effects that imposing a fairness constraint on a classifier will have. This is tricky, of course --- if we don't commit to some model of the world, then different fairness constraints can have either good or bad long term effects, which still doesn't give us much guidance. Of course making modeling assumptions has its own risks: inevitably the model won't match reality, and we should worry that the results that we derive in our stylized model will not tell us anything useful about the real world. Nevertheless, it is worth trying to proceed: all models are wrong, but some are useful. Our goal will be to come up with a clean, simple model, in which results are robust to modelling choices, and the necessary assumptions are clearly identified. Hopefully the result is some nugget of insight that applies outside of the model. This is what we try to do in our new paper with Chris Jung, Sampath Kannan, Changhwa Lee, Mallesh Pai, and Rakesh Vohra. We'll use the language of criminal justice here, but the model is simple enough that you could apply it to a number of other settings of interest in which we need to design binary classification rules. 

In our model, individuals make rational choices about whether or not to commit crimes: that is, individuals have some "outside option" (their opportunity for legal employment, for example), some expected monetary benefit of crime, and some dis-utility for being incarcerated. In deciding whether or not to commit a crime, an individual will weigh their expected benefit of committing a crime, compared to taking their outside option ---- and this calculation will involve their risk of being incarcerated if they commit a crime, and also if they do not (since inevitably any policy will both occasionally free the guilty as well as incarcerate the innocent). Different people might make different decisions because their benefits and costs of crime may differ --- for example, some people will have better opportunities for legal employment than others. And in our model, the only way two different populations differ is in their distributions of these benefits and costs. Each person draws, i.i.d. from a distribution corresponding to their group, a type which encodes this outside option value and cost for incarceration. So in our model, populations differ e.g. only in their access to legal employment opportunities, and this is what will underlie any difference in criminal base rates.  

As a function of whether each person commits a crime or not, a "noisy signal" is generated. In general, think of higher signals as corresponding to increased evidence of guilt, and so if someone commits a crime, they will tend to draw higher signals than those who don't commit crimes --- but the signals are noisy, so there is no way to perfectly identify the guilty. 

Incarceration decisions are made as a function of these noisy signals: society has a choice as to what incarceration rule to choose, and can potentially choose a different rule for different groups. Once an incarceration rule is chosen, this determines each person's incentive to commit crime, which in turn fixes a base rate of crime in each population. In general, base rates will be different across different groups (because outside option distributions differ), so the impossibility of e.g. equalizing false positive rates, false negative rates, and positive predictive value across groups will hold in our setting. Since crime rates in our setting are a function of the incarceration rule we choose, there is a natural objective to consider: finding the policy that minimizes crime

Lets think about how we might implement different fairness notions in this setting. First, how should we think about posterior probabilities that an individual will commit a crime? Before we see an individual's noisy signal, but after we see his group membership, we can form our prior belief that he has committed a crime --- this is just the base crime rate in his population. After we observe his noisy signal, we can use Bayes rule to calculate a posterior probability that he has committed a crime. So we could apply the "uniform posterior threshold" approach to fairness and use an incarceration rule that would incarcerate an individual exactly when their posterior probability of having committed a crime exceeded some uniform threshold. But note that because crime rates (and hence prior probabilities of crime) will generically differ between populations (because outside option distributions differ), setting the -same- threshold on posterior probability of crime for both groups corresponds to setting different thresholds on the raw noisy signals. This makes sense --- a Bayesian doesn't need as strong evidence to convince her that someone from a high crime group has committed a crime, as she would need to be convinced that someone from a low crime group has committed a crime, because she started off with a higher prior belief about the person from the high crime group. This (as we already know) results in a classification rule that has different false positive rates and false negative rates across groups. 

On the other hand, if we want to equalize false positive and false negative rates across groups, we need an incarceration rule that sets the same threshold on raw noisy signals, independently of group. This will of course correspond to setting different thresholds on the posterior probability of crime (i.e. thresholding calibrated risk scores differently for different groups). And this will always be sub-optimal from the point of view of predicting crime --- the Bayes optimal predictor uniformly thresholds posterior probabilities. 

Which Notions of Fairness Lead to Desirable Outcomes?

But only one of these solutions is consistent with our social goal of minimizing crime. And its not the Bayes optimal predictor. The crime-minimizing solution is the one that sets different thresholds on posterior probabilities (i.e. uniform thresholds on signals) so as to equalize false positive rates and false negative rates. In other words, to minimize crime, society should explicitly commit to not conditioning on group membership, even when group membership is statistically informative for the goal of predicting crime. 

Why? Its because although using demographic information is statistically informative for the goal of predicting crime when base rates differ, it is not something that is under the control of individuals --- they can control their own choices, but not what group they were born into. And making decisions about individuals using information that is not under their control has the effect of distorting their dis-incentive to commit crime --- it ends up providing less of a dis-incentive to individuals from the higher crime group (since they are more likely to be wrongly incarcerated even if they don't commit a crime). And because in our model people are rational actors, minimizing crime is all about managing incentives. 

This is our baseline model, and in the paper we introduce a number of extensions, generalizations, and elaborations on the model in order to stress-test it. The conclusions continue to hold in more elaborate and general settings, but at a high level, the key assumptions that are needed to reach them are that:
  1. The underlying base rates are rationally responsive to the decision rule used by society.
  2. Signals are observed at the same rates across populations, and
  3. The signals are conditionally independent of an individual’s group, conditioned on the individual’s decision about whether or not to commit crime.
Here, conditions (2) and (3) are unlikely to hold precisely in most situations,  but we show that they can be relaxed in various ways while still preserving the core conclusion.

But more generally, if we are in a setting in which we believe that individual decisions are rationally made in response to the deployed classifier, and yet the deployed classifier does not equalize false  positive and negative rates, then this is an indication that either the deployed classifier is sub-optimal (for the purpose of minimizing crime rates), or that one of conditions (2) and (3) fails to hold.  Since in fairness relevant settings, the failure of conditions (2) and (3) is itself undesirable, this can be a diagnostic to highlight discriminatory conditions earlier in the pipeline than the final incarceration rule.  In particular, if conditions (2) or (3) fail to hold, then imposing technical fairness constraints on a deployed classifier may be premature, and instead attention should be focused on structural differences in the observations that are being fed into the deployed classifier.