tag:blogger.com,1999:blog-255627052020-11-29T19:21:42.157-05:00Adventures in ComputationAaronhttp://www.blogger.com/profile/09952936358739421126noreply@blogger.comBlogger98125tag:blogger.com,1999:blog-25562705.post-79586925471562883912020-09-01T10:16:00.002-04:002020-09-02T13:39:14.477-04:00No Regret Algorithms from the Min Max Theorem<script type="text/x-mathjax-config"> MathJax.Hub.Config({tex2jax: {inlineMath: [['$','$'], ['\\(','\\)']]}}); </script> <script src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML" type="text/javascript"> </script>The existence of no-regret learning algorithms can be used to prove <a href="https://en.wikipedia.org/wiki/Minimax_theorem">Von-Neumann's min-max theorem</a>. This argument is originally due to <a href="https://www.cs.princeton.edu/~schapire/papers/FreundScYY.pdf">Freund and Schapire</a>, and I <a href="https://www.cis.upenn.edu/~aaroth/courses/slides/agt20/lect08.pdf">teach it to my undergraduates </a>in my algorithmic game theory class. The min-max theorem also can be used to prove the existence of no-regret learning algorithms. Here is a constructive version of the argument (Constructive in that in the resulting algorithm, you only need to solve polynomially sized zero-sum games, so you can do it via linear programming).<div><br /></div><div>Recall the setting. Play proceeds in rounds $t \in \{1,\ldots,T\}$. At each day $t$, the learner chooses one of $k$ actions $i_t \in \{1,\ldots,k\}$, and the adversary chooses a loss vector $\ell^t \in [0,1]^k$. The learner incurs loss $\ell^t_{i_t}$, corresponding to the action he chose. At the end of the interaction, the regret of the learner is defined to be the difference between the cumulative loss he incurred and the cumulative loss of the best fixed action (played consistently) in hindsight:</div><div>$$\textrm{Regret}_T = \max_j \left(\sum_{t=1}^T \ell^t_{i_t} - \ell^t_j\right)$$ </div><div>A classical and remarkable result is that there exist algorithms that can guarantee that regret grows only sublinearly with time: $\textrm{Regret}_T = O(\sqrt{T})$. Lets prove this. </div><div><br /></div><div><br /></div><div>Define the non-negative portion of our cumulative regret with respect to action $j$ up until day $d$ as:</div><div>$$V_d^j = \left(\sum_{t=1}^d\left(\ell^t_{i_t} - \ell^t_j\right)\right)^+$$</div><div>and our additional regret at day $d+1$ with respect to action $j$ as:</div><div>$$r_{j}^{d+1} = \ell_{i_{d+1}}^{d+1} - \ell^{d+1}_j$$</div><div>Observe that if $V_{d}^j \geq 1$ then $V_{d+1}^j = V_d^j + r_j^{d+1}$. </div><div><br /></div><div><br /></div><div>Define a surrogate loss function as our squared cumulative regrets, summed over all actions: </div><div>$$L_d = \sum_{j=1}^k (V_d^j)^2$$</div><div>Observe that we can write the expected gain in our loss on day $d+1$, conditioned on the history thus far:</div><div>$$\mathbb{E}[L_{d+1} - L_d] \leq \sum_{j : V_d^j \geq 1} \mathbb{E}[(V_d^j+r_j^{d+1})^2 - (V_d^j)^2) ] + 3k$$</div><div>$$= \sum_{j : V_d^j \geq 1} \left(2V_d^j \mathbb{E}[r_{j}^{d+1}] + \mathbb{E}[(r_{j}^{d+1})^2]\right) + 3k $$</div><div>$$\leq \sum_{j=1}^k \left(2V_d^j \mathbb{E}[r_{j}^{d+1}]\right) + 4k$$</div><div>where the expectations are taken over the randomness of both the learner and the adversary in round $d+1$. </div><div><br /></div><div>Now consider a zero-sum game played between the learner and the adversary in which the learner is the minimization player, the adversary is the maximization player, and the utility function is $$u(i_{d+1}, \ell^{d+1}) = \sum_{j=1}^k \left(2V_d^j \mathbb{E}[r_{j}^{d+1}]\right)$$ The min-max theorem says that the learner can guarantee the same payoff for herself in the following two scenarios:</div><div><br /></div><div><ol style="text-align: left;"><li>The learner first has to commit to playing a distribution $p_{d+1}$ over actions $i$, and then the adversary gets to best respond by picking the worst possible loss vectors, or</li><li>The adversary has to first commit to a distribution over loss vectors $\ell$ and then the learner gets the benefit of picking the best action $i_{d+1}$ to respond with. </li></ol>Scenario 1) is the scenario our learner finds herself in, when playing against an adaptive adversary. But 2) is much easier to analyze. If the adversary first commits to a distribution over loss vectors $\ell^{d+1}$, the learner can always choose action $i_{d+1} = \arg\min_j \mathbb{E}[\ell^{d+1}_j]$, which guarantees that $\mathbb{E}[r_{j}^{d+1}] \leq 0$, which in turn guarantees that the value of the game $ \sum_{j=1}^k \left(2V_d^j \mathbb{E}[r_{j}^{d+1}]\right) \leq 0$. Hence, the min-max theorem tells us that the learner always has a distribution over actions $p_{d+1}$ that guarantees that $\mathbb{E}[L_{d+1} - L_d] \leq 4k$, <i>even in the worst case over loss functions</i>. If the learner always plays according to this distribution, then by a telescoping sum, we have that:</div><div>$$\mathbb{E}[L_T] \leq 4kT$$.</div><div>We therefore have by Jensen's inequality that:</div><div>$$\mathbb{E}[\max_j (V^j_T)] \leq \sqrt{\mathbb{E}[\max_j (V^j_T)^2]}\leq \sqrt{\mathbb{E}[\sum_{j=1}^k (V_j^T)^2]} \leq 2\sqrt{kT}$$.</div>Aaronhttp://www.blogger.com/profile/09952936358739421126noreply@blogger.com0tag:blogger.com,1999:blog-25562705.post-412575425420661992020-08-19T07:47:00.001-04:002020-09-30T19:34:40.096-04:00Moment Multicalibration for Uncertainty Estimation<script type="text/x-mathjax-config"> MathJax.Hub.Config({tex2jax: {inlineMath: [['$','$'], ['\\(','\\)']]}}); </script> <script src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML" type="text/javascript"> </script>This blog post is about<a href="https://arxiv.org/abs/2008.08037" target="_blank"> a new paper</a> that I'm excited about, which is joint work with <a href="https://www.cis.upenn.edu/~chrjung/" target="_blank">Chris Jung</a>,<a href="https://economics.sas.upenn.edu/people/changhwa-lee" target="_blank"> Changhwa Lee</a>, <a href="https://sites.google.com/view/malleshpai/">Mallesh Pai</a>, and <a href="https://sites.google.com/site/quaerereverum9/">Ricky Vohra</a>. If you prefer watching talks, you can watch one I gave to the Wharton statistics department <a href="https://youtu.be/PfQ9vfAKkFM">here</a>.<div><br /></div><div>Suppose you are diagnosed with hypertension, and your doctor recommends that you take a certain drug to lower your blood pressure. The latest research, she tells you, finds that the drug lowers diastolic blood pressure by an average of 10 mm Hg. You remember your statistics class from college, and so you ask about confidence intervals. She looks up the paper, and tells you that it reports a 95% confidence interval of [5, 15]. How should you interpret this? </div><div><br /></div><div>What you might naively hope is that [5, 15] represents a <i>conditional prediction interval</i>. If you have some set of observable features $x$, and a label $y$ (in this case corresponding to your decrease in diastolic blood pressure after taking the drug), a 95% conditional prediction interval would promise that:</div><div>$$\Pr_y [y \in [5, 15] | x] \geq 0.95$$</div><div><br /></div><div>In other words, a conditional prediction interval would promise that given all of your observed features, <i>over the unrealized/unmeasured randomness of the world</i>, there is a 95% chance that your diastolic blood pressure will decrease by between 5 and 15 points. </div><div><br /></div><div>But if you think about it, coming up with a conditional prediction interval is essentially impossible in a rich feature space. If $x$ contains lots of information about you, then probably there was nobody in the original study population that exactly matched your set of features $x$, and so we have no information at all about the conditional distribution on $y$ given $x$ --- i.e. no samples at all from the distribution over which our coverage probability supposedly holds! So how can you expect any sort of promise at all? There are two typical ways around this difficulty. </div><div><br /></div><div>The first is to make heroic assumptions about the data generation process. For example, if we assume that the world looks like an ordinary least squares model, and that there is a linear relationship between $y$ and $x$, then we can form a confidence region around the parameters of the model, and from that derive prediction intervals. But these prediction intervals are not valid if the model fails to hold, which it inevitably will. </div><div><br /></div><div>The second is to give up on conditional prediction intervals, and instead give <i>marginal prediction intervals</i>. This is what the <a href="https://arxiv.org/abs/0706.3188" target="_blank">conformal prediction</a> literature aims to do. A marginal prediction interval looks quite similar to a conditional prediction interval (at least syntactically), and promises:</div><div>$$\Pr_{(x,y)} [y \in [5, 15] ] \geq 0.95$$</div><div><br /></div><div>Rather than conditioning on your features $x$, a marginal prediction interval averages over all people, and promises that 95% of people who take the drug have their diastolic blood pressure lowered by between 5 and 15 points. But the semantics of this promise are quite different than that of a conditional prediction interval. Because the average is now taken over a large, heterogeneous population, very little is promised to <i>you</i>. For example, it might be that for patients in your demographic group (e.g. middle aged women with Sephardic Jewish ancestry and a family history of diabetes) that the drug is actually expected to raise blood pressure rather than lower it. Because this subgroup represents less than 5% of the population, it is entirely consistent with the marginal prediction interval being correct. Of course, if you are lucky, then perhaps someone has conducted a study of people from this demographic group and has computed marginal prediction intervals over it! But what if there are multiple different groups that you are a member of, over which the results seem to conflict? For example, you might also have a low BMI value and have unusually good cholesterol readings --- features of a group for which the drug works unusually well. Which uncertainty estimate should you trust, if you are a member of both groups? </div><div><br /></div><div>These concerns actually arise already when we think about the semantics of mean estimations ("the expected drop in blood pressure amongst patients who take this drug is 10 mm Hg"). Ideally, if you were a patient with features $x$, then 10 would be an estimate of $\mathbb{E}[y | x]$. But just as with uncertainty estimation, in a large feature space, we typically have no information about the distribution on $y$ conditional on $x$ (because we have never met anyone exactly like <i>you</i> before), and so instead what we have is just an estimate of $\mathbb{E}[y]$ --- i.e. averaging over people. If you have a method of making predictions $f(x)$ as a function of features $x$, then a standard performance metric is <i>calibration</i> --- which informally asks that for every prediction $p$, amongst all people for whom we predicted $f(x) = p$, the average of the realized labels $y$ should be $p$. Again, estimates of this form promise little to individuals, because they are averages over a large and heterogeneous population. </div><div><br /></div><div>Several years ago, <a href="https://arxiv.org/abs/1711.08513" target="_blank">Hebert-Johnson et al.</a> proposed a nice way to interpolate between the (impossible) ideal of offering conditional mean predictions $f(x) = \mathbb{E}[y | x]$, and the weak guarantee of merely offering calibrated predictions $f$. Roughly speaking, they proposed to specify a very large collection of potentially intersecting groups $G$ (representing e.g. demographic groups like Sephardic Jewish women with a family history of diabetes, and hypertensive patients with low cholesterol and BMI values, etc) and to ask that a trained predictor be <i>simultaniously</i> calibrated on each sufficiently large group in $G$. They showed how to accomplish this using a polynomially sized sample from the underlying distribution, with polynomial running time overhead, on top of the cost of solving learning problems over $G$. </div><div><br /></div><div>In our paper, we --- roughly speaking --- show how to accomplish the same thing, but for variances and other higher moments, in addition to just means. And our "multicalibrated moment estimates" can be used to construct prediction intervals in exactly the same way that real moments of the conditional label distribution could be used. If you used the real (unknown) label distribution moments, you would have gotten conditional prediction intervals. If you use our multi-calibrated moments, you get marginal prediction intervals that are simultaneously valid as averaged over each of the groups in $G$. So, for example, our hypertensive patient above could interpret her prediction interval --- if it was constructed from multicalibrated moment estimates computed from her features --- as an average over each of the demographic groups that she is a member of (so long as they are contained within $G$), and all of those interpretations would be simultaneously valid. </div><div><br /></div><div>I'll leave the details to the paper --- including what exactly we mean by "moment multicalibration". I'll just note that a major difficulty is that variances and higher moments --- unlike expectations --- do not combine linearly, so it is no longer sensible to ask that "amongst all people for whom we predicted variance v, the true variance should be v" --- because even the true conditional label variances do not satisfy this property. But it <i>is </i>sensible to ask that a pair of mean and moment predictions be calibrated in this way: "amongst all people for whom we predicted mean $\mu$ and variance v, the true mean should be $\mu$ and the true variance should be $v$." This is what we call "mean-conditioned moment calibration", and it is satisfied by the true distributional moments. </div><div><br /></div><div>The paper is here: <a href="https://arxiv.org/abs/2008.08037">Moment Multicalibration for Uncertainty Estimation</a>.</div><div><br /></div>Aaronhttp://www.blogger.com/profile/09952936358739421126noreply@blogger.com0tag:blogger.com,1999:blog-25562705.post-41135314329461121222020-06-05T18:59:00.002-04:002020-06-05T18:59:33.143-04:00TCS Visioning Workshop — Call for ParticipationReposting from here: <a href="https://thmatters.wordpress.com/2020/06/05/tcs-visioning-workshop-call-for-participation/">https://thmatters.wordpress.com/2020/06/05/tcs-visioning-workshop-call-for-participation/</a><br /><br /><div style="background-color: white; color: #555555; font-family: "Droid Serif", arial, serif; font-size: 14px;"><div style="line-height: 22px; padding: 0px 0px 20px;"><span class="im">The <a href="https://thmatters.wordpress.com/catcs/" style="color: #3f9291;">CATCS</a> will be hosting a virtual <strong>“Visioning Workshop”</strong> the <strong>week of </strong></span><strong>July 20</strong> in order to identify broad research themes within theoretical computer science (TCS) that have potential for a major impact in the future. The goals are similar to the <a href="https://thmatters.wordpress.com/visioning-workshop/" style="color: #3f9291;">workshop of the same name</a> in 2008: to package these themes in a way that can be consumed <span class="im">by the general public, which we would deliver primarily to the Computing Community Consortium and others (e.g. funding agencies) to help them advocate for TCS.</span></div><div style="line-height: 22px; padding: 0px 0px 20px;">While participation in the workshop is primarily through invitation, we have a few slots available for the broader community. If you are interested in participating, please see details of the application process below. The workshop will be organized according to area-wise breakout groups. Each breakout group will have 1-2 leads. Breakout groups will meet for 4-5 hours spread across several days and will be tasked with brainstorming ideas and preparing materials related to their topic. Leads are further expected to participate in plenary sessions held on Monday July 20 and Friday July 24 (4-5 hrs of additional time) where these materials will be discussed.</div><div style="line-height: 22px; padding: 0px 0px 20px;">If you are interested in participating in the workshop, please fill out <a href="https://forms.gle/cdCTsLfUs56CDhKS9" style="color: #3f9291;">this Google form</a> by <strong>Monday June 15</strong>. On this form, applicants are asked to contribute one or two <span class="im">major results in the last 10 years whose significance can be explained in layperson terms, and one or two major challenges for theory whose significance can be explained in layperson terms. These descriptions </span>can be very brief. We will just use them to select participants and create breakout groups.</div></div><div style="background-color: white; color: #555555; font-family: "Droid Serif", arial, serif; font-size: 14px;"></div><div style="background-color: white; color: #555555; font-family: "Droid Serif", arial, serif; font-size: 14px;"><br /></div>Aaronhttp://www.blogger.com/profile/09952936358739421126noreply@blogger.com0tag:blogger.com,1999:blog-25562705.post-46591912080654234672020-05-16T14:27:00.002-04:002020-05-16T14:27:27.214-04:00FORC 2020 ProgramThe FORC 2020 program is now available here: <a href="https://responsiblecomputing.org/program/">https://responsiblecomputing.org/program/</a> (and reproduced below) Note the <i>terrific</i> set of contributed talks at keynotes, and note that registration is <i>free</i>! There will also be junior/senior breakout sessions into small groups during the lunch breaks to facilitate informal conversation and networking. Please consider attending, and help to spread the word!<br /><br /><div style="color: #666666; font-family: Roboto, Arial, Helvetica, sans-serif; font-size: 16px; margin-bottom: 24px; margin-top: 24px; padding: 0px;">FORC 2020 will take place virtually (on Zoom) on June 1st and 2nd, 2020. All times below are in Eastern Daylight Time (i.e. Boston/New York/Philadelphia time).</div><div style="color: #666666; font-family: Roboto, Arial, Helvetica, sans-serif; font-size: 16px; margin-bottom: 24px; margin-top: 24px; padding: 0px;"><br /></div><div style="color: #666666; font-family: Roboto, Arial, Helvetica, sans-serif; font-size: 16px; margin-bottom: 24px; margin-top: 24px; padding: 0px;">Registration is FREE but mandatory (Zoom links will be sent to registered email addresses). Register here before May 28: <a href="https://forms.gle/vgTzkMS6jLN9xPud8" style="color: #99cc33; text-decoration-line: none; transition: background 0.25s ease 0s, color 0.25s ease 0s;">https://forms.gle/vgTzkMS6jLN9xPud8</a></div><div style="color: #666666; font-family: Roboto, Arial, Helvetica, sans-serif; font-size: 16px; margin-bottom: 24px; margin-top: 24px; padding: 0px;"><br /></div><div style="color: #666666; font-family: Roboto, Arial, Helvetica, sans-serif; font-size: 16px; margin-bottom: 24px; margin-top: 24px; padding: 0px; text-align: center;"><span style="font-weight: 700;">Day 1: June 1, 2020. </span></div><div style="color: #666666; font-family: Roboto, Arial, Helvetica, sans-serif; font-size: 16px; margin-bottom: 24px; margin-top: 24px; padding: 0px;">10:20-10:30 Opening Remarks</div><div style="color: #666666; font-family: Roboto, Arial, Helvetica, sans-serif; font-size: 16px; margin-bottom: 24px; margin-top: 24px; padding: 0px;">10:30-11:20 <span style="font-weight: 700;">Keynote Talk</span>: <span style="font-weight: 700;">Adrian Weller</span>, University of Cambridge (Session Chair: <em>Suresh Venkatasubramanian</em>)</div><div style="color: #666666; font-family: Roboto, Arial, Helvetica, sans-serif; font-size: 16px; margin-bottom: 24px; margin-top: 24px; padding: 0px;">11:20-12:00 Session 1 (Session Chair: <em>Adam Smith)</em></div><ol style="color: #666666; font-family: Roboto, Arial, Helvetica, sans-serif; font-size: 16px; list-style: none; margin: 24px 0px; padding: 0px 0px 0px 12px;"><li style="list-style: inside decimal; margin: 6px 0px; padding: 0px 0px 0px 12px;">Aloni Cohen and Kobbi Nissim. <span style="font-weight: 700;">Towards Formalizing the GDPR’s Notion of Singling Out</span></li><li style="list-style: inside decimal; margin: 6px 0px; padding: 0px 0px 0px 12px;">Charlotte Peale, Omer Reingold and Katrina Ligett. <span style="font-weight: 700;">Bounded-Leakage Differential Privacy</span></li><li style="list-style: inside decimal; margin: 6px 0px; padding: 0px 0px 0px 12px;">Moni Naor and Neil Vexler.<span style="font-weight: 700;"> Can Two Walk Together: Privacy Enhancing Methods and Preventing Tracking of Users</span></li><li style="list-style: inside decimal; margin: 6px 0px; padding: 0px 0px 0px 12px;">Badih Ghazi, Ravi Kumar, Pasin Manurangsi and Rasmus Pagh. <span style="font-weight: 700;">Private Counting from Anonymous Messages: Near-Optimal Accuracy with Vanishing Communication Overhead</span></li><li style="list-style: inside decimal; margin: 6px 0px; padding: 0px 0px 0px 12px;">Badih Ghazi, Noah Golowich, Ravi Kumar, Rasmus Pagh and Ameya Velingker. <span style="font-weight: 700;">On the Power of Multiple Anonymous Messages</span></li></ol><div style="color: #666666; font-family: Roboto, Arial, Helvetica, sans-serif; font-size: 16px; margin-bottom: 24px; margin-top: 24px; padding: 0px;">12:00-1:30 Virtual Lunch/Informal Discussion/Networking</div><div style="color: #666666; font-family: Roboto, Arial, Helvetica, sans-serif; font-size: 16px; margin-bottom: 24px; margin-top: 24px; padding: 0px;">1:30-2:20 <span style="font-weight: 700;">Keynote Talk: Rakesh Vohra</span>, University of Pennsylvania (Session Chair: <em>Rachel Cummings</em>)</div><div style="color: #666666; font-family: Roboto, Arial, Helvetica, sans-serif; font-size: 16px; margin-bottom: 24px; margin-top: 24px; padding: 0px;">2:20-3:05 Session 2 (Session Chair: <em>Nicole Immorlica</em>)</div><ol style="color: #666666; font-family: Roboto, Arial, Helvetica, sans-serif; font-size: 16px; list-style: none; margin: 24px 0px; padding: 0px 0px 0px 12px;"><li style="list-style: inside decimal; margin: 6px 0px; padding: 0px 0px 0px 12px;">Lee Cohen, Zachary Lipton and Yishay Mansour. <span style="font-weight: 700;">Efficient candidate screening under multiple tests and implications for fairness</span></li><li style="list-style: inside decimal; margin: 6px 0px; padding: 0px 0px 0px 12px;">Cynthia Dwork, Christina Ilvento, Guy Rothblum and Pragya Sur.<span style="font-weight: 700;"> Abstracting Fairness: Oracles, Metrics, and Interpretability</span></li><li style="list-style: inside decimal; margin: 6px 0px; padding: 0px 0px 0px 12px;">Christina Ilvento.<span style="font-weight: 700;"> Metric Learning for Individual Fairness</span></li><li style="list-style: inside decimal; margin: 6px 0px; padding: 0px 0px 0px 12px;">Kevin Stangl and Avrim Blum.<span style="font-weight: 700;"> Recovering from Biased Data: Can Fairness Constraints Improve Accuracy?</span></li></ol><div style="color: #666666; font-family: Roboto, Arial, Helvetica, sans-serif; font-size: 16px; margin-bottom: 24px; margin-top: 24px; padding: 0px; text-align: center;"><span style="font-weight: 700;">End Day 1</span></div><div style="color: #666666; font-family: Roboto, Arial, Helvetica, sans-serif; font-size: 16px; margin-bottom: 24px; margin-top: 24px; padding: 0px; text-align: center;"><span style="font-weight: 700;">Day 2: June 2, 2020.</span></div><div style="color: #666666; font-family: Roboto, Arial, Helvetica, sans-serif; font-size: 16px; margin-bottom: 24px; margin-top: 24px; padding: 0px;">10:30-11:20 <span style="font-weight: 700;">Keynote Talk</span>: <span style="font-weight: 700;">Patricia Williams, </span>Northeastern University (Session Chair: <em>Cynthia Dwork</em>)</div><div style="color: #666666; font-family: Roboto, Arial, Helvetica, sans-serif; font-size: 16px; margin-bottom: 24px; margin-top: 24px; padding: 0px;">11:20-12:00 Session 3 (Session Chair: <em>Steven Wu)</em></div><ol style="color: #666666; font-family: Roboto, Arial, Helvetica, sans-serif; font-size: 16px; list-style: none; margin: 24px 0px; padding: 0px 0px 0px 12px;"><li style="list-style: inside decimal; margin: 6px 0px; padding: 0px 0px 0px 12px;">Arpita Biswas, Siddharth Barman, Amit Deshpande and Amit Sharma. <span style="font-weight: 700;">Inframarginality Audit of Group-Fairness</span></li><li style="list-style: inside decimal; margin: 6px 0px; padding: 0px 0px 0px 12px;">Christopher Jung, Sampath Kannan and Neil Lutz. <span style="font-weight: 700;">Service in Your Neighborhood: Fairness in Center Location</span></li><li style="list-style: inside decimal; margin: 6px 0px; padding: 0px 0px 0px 12px;">Cynthia Dwork, Christina Ilvento and Meena Jagadeesan. <span style="font-weight: 700;">Individual Fairness in Pipelines</span></li><li style="list-style: inside decimal; margin: 6px 0px; padding: 0px 0px 0px 12px;">Hao Wang, Hsiang Hsu, Mario Diaz and Flavio Calmon. <span style="font-weight: 700;">To Split or Not to Split: The Impact of Disparate Treatment in Classification</span></li></ol><div style="color: #666666; font-family: Roboto, Arial, Helvetica, sans-serif; font-size: 16px; margin-bottom: 24px; margin-top: 24px; padding: 0px;">12:00-1:30 Virtual Lunch/Informal Discussion/Networking</div><div style="color: #666666; font-family: Roboto, Arial, Helvetica, sans-serif; font-size: 16px; margin-bottom: 24px; margin-top: 24px; padding: 0px;">1:30-2:20 <span style="font-weight: 700;">Keynote Talk: Jon Kleinberg</span>, Cornell University. (Session Chair: <em>Michael Kearns</em><span style="font-weight: 700;">)</span></div><div style="color: #666666; font-family: Roboto, Arial, Helvetica, sans-serif; font-size: 16px; margin-bottom: 24px; margin-top: 24px; padding: 0px;">2:20-3:05 Session 4 (Session Chair: <em>Guy Rothblum)</em></div><ol style="color: #666666; font-family: Roboto, Arial, Helvetica, sans-serif; font-size: 16px; list-style: none; margin: 24px 0px; padding: 0px 0px 0px 12px;"><li style="list-style: inside decimal; margin: 6px 0px; padding: 0px 0px 0px 12px;">Ashesh Rambachan and Jonathan Roth. <span style="font-weight: 700;">Bias In, Bias Out? Evaluating the Folk Wisdom</span></li><li style="list-style: inside decimal; margin: 6px 0px; padding: 0px 0px 0px 12px;">Mark Braverman and Sumegha Garg. <span style="font-weight: 700;">The Role of Randomness and Noise in Strategic Classification</span></li><li style="list-style: inside decimal; margin: 6px 0px; padding: 0px 0px 0px 12px;">Sergei Mikhalishchev and Andrei Matveenko. <span style="font-weight: 700;">Attentional Role of Quota Implementation</span></li><li style="list-style: inside decimal; margin: 6px 0px; padding: 0px 0px 0px 12px;">Roy Dong, Erik Miehling and Cedric Langbort. <span style="font-weight: 700;">Protecting Consumers Against Personalized Pricing: A Stopping Time Approach</span></li></ol><div style="color: #666666; font-family: Roboto, Arial, Helvetica, sans-serif; font-size: 16px; margin-bottom: 24px; margin-top: 24px; padding: 0px;">3:05-3:15 Closing Remarks.</div><div style="color: #666666; font-family: Roboto, Arial, Helvetica, sans-serif; font-size: 16px; margin-bottom: 24px; margin-top: 24px; padding: 0px; text-align: center;"><span style="font-weight: 700;">End of Conference</span></div><div class="sharedaddy sd-like-enabled sd-sharing-enabled" id="jp-post-flair" style="clear: both; color: #666666; font-family: Roboto, Arial, Helvetica, sans-serif; font-size: 16px; padding-top: 0.5em;"><div class="sharedaddy sd-sharing-enabled" style="clear: both;"><div class="robots-nocontent sd-block sd-social sd-social-icon-text sd-sharing"><h3 class="sd-title" style="color: #504741; display: inline-block; font-family: Montserrat, "Helvetica Neue", Arial, sans-serif; font-size: 9pt; letter-spacing: -0.4px; line-height: 1.2; margin: 0px 0px 1em; padding: 0px;">Share this:</h3></div></div></div>Aaronhttp://www.blogger.com/profile/09952936358739421126noreply@blogger.com0tag:blogger.com,1999:blog-25562705.post-46863870492697403602020-03-29T20:38:00.004-04:002020-03-29T20:38:47.675-04:00FORC 2020 Accepted PapersWhen we <a href="https://aaronsadventures.blogspot.com/2019/10/forc-new-conference-you-should-know.html">announced </a>the new conference <a href="https://responsiblecomputing.org/accepted-papers/">FORC (Foundations of Responsible Computing)</a> we really had no idea what kind of papers folks would send us.<br /><br />Fortunately, we got a really high quality set of submissions, from which we have accepted the papers that will make up the program of the inaugural FORC. Check out the accepted papers here: <a href="https://responsiblecomputing.org/accepted-papers/">https://responsiblecomputing.org/accepted-papers/</a>Aaronhttp://www.blogger.com/profile/09952936358739421126noreply@blogger.com0tag:blogger.com,1999:blog-25562705.post-36832508142332323752020-02-17T11:08:00.000-05:002020-02-17T11:54:18.501-05:00Fair Prediction with Endogenous Behavior<h2 style="text-align: center;">Can Game Theory Help Us Choose Among Fairness Constraints?</h2><br /><div style="text-align: center;"><i>This blog post is about a <a href="https://www.cis.upenn.edu/~aaroth/Papers/endogenous.pdf">new paper</a>, joint with Christopher Jung, Sampath Kannan, Changhwa Lee, Mallesh M. Pai, and Rakesh Vohra.</i></div><br /><br />A lot of the recent boom in interest in fairness in machine learning can be traced back to the 2016 Propublica article <a href="https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing">Machine Bias</a>. To summarize what you will already know if you have interacted with the algorithmic fairness literature at all --- Propublica discovered that the COMPAS recidivism prediction instrument (used to inform bail and parole decisions by predicting whether individuals would go on to commit violent crimes if released) made errors of different sorts on different populations. The false positive rate (i.e. the rate at which it incorrectly labeled people "high risk") was much higher on the African American population than on the white population, and the false negative rate (i.e. the rate at which it incorrectly labeled people as "low risk") was much higher on the white population. Because being falsely labeled high risk is harmful (it decreases the chance you are released), this was widely and reasonably viewed as unfair.<br /><br />But the story wasn't so simple. Northpointe, the company that produced COMPAS (They have since changed their name) <a href="https://go.volarisgroup.com/rs/430-MBX-989/images/ProPublica_Commentary_Final_070616.pdf">responded </a>by pointing out that their instrument satisfied predictive parity across the two populations --- i.e. that the <i>positive predictive value </i>of their instrument was roughly the same for both white and African American populations. This means that their predictions conveyed the same meaning across the two populations: the people that COMPAS predicted were high risk had roughly the same chance of recidivating, on average, whether or not they were black or white. This is also desirable, because if we use an instrument that produces predictions whose meanings differ according to an individual's demographic group, then we are explicitly incentivizing judges to make decisions based on race, after they are shown the prediction of the instrument. Of course, we now know that simultaneously equalizing false positive rates, false negative rates, and positive predictive values across populations is <a href="https://arxiv.org/abs/1610.07524">generically impossible</a> --- i.e. it is impossible except under very special conditions, such as when the underlying crime rate is exactly the same in both populations. This <a href="http://aaronsadventures.blogspot.com/2019/02/impossibility-results-in-fairness-as.html">follows from thinking about Bayes Rule</a>.<br /><br />Another sensible notion of fairness suggests that "similarly risky people should be treated similarly". This harkens back to notions of <a href="https://arxiv.org/abs/1104.3913">individual fairness</a>, and suggests that we should do something like the following: we should gather as much information about an individual as we possibly can, and condition on all of it to find a (hopefully correct) posterior belief that they will go on to commit a crime. Then, we should make incarceration decisions by subjecting everyone to the same threshold on these posterior beliefs --- any individual who crosses some uniform threshold should be incarcerated; anyone who doesn't cross the threshold should not be. This is the approach that <a href="https://arxiv.org/abs/1808.00023">Corbett-Davies and Goel </a>advocate for, and it seems to have a lot going for it. In addition to uniform thresholds feeling fair, its also easy to see that doing this is the Bayes-optimal decision rule to optimize any societal cost function that differently weights the cost of false positives and false negatives. But applying a uniform threshold on posterior distributions unfortunately will generally result in a decision rule that neither equalizes false positive and false negative rates, nor positive predictive value. Similarly, satisfying these other notions of fairness will generally result in a decision rule that is sub-optimal in terms of its predictive performance.<br /><br />Unfortunately, this leaves us with little guidance --- should we aim to equalize false positive and negative rates (sometime called <i><a href="https://arxiv.org/abs/1610.02413">equalized odds</a></i> in this literature)? Should we aim to equalize positive predictive value? Or should we aim for using uniform thresholds on posterior beliefs? Should we aim for something else entirely? More importantly, by what means should we aim to make these decisions?<br /><br /><h2>A Game Theoretic Model</h2><div>One way we can attempt to choose among different fairness "solution concepts" is to try and think about the larger societal effects that imposing a fairness constraint on a classifier will have. This is tricky, of course --- if we don't commit to some model of the world, then different fairness constraints can have either <a href="https://arxiv.org/abs/1803.04383">good or bad long term effects</a>, which still doesn't give us much guidance. Of course making modeling assumptions has its own risks: inevitably the model won't match reality, and we should worry that the results that we derive in our stylized model will not tell us anything useful about the real world. Nevertheless, it is worth trying to proceed: all models are wrong, but some are useful. Our goal will be to come up with a clean, simple model, in which results are robust to modelling choices, and the necessary assumptions are clearly identified. Hopefully the result is some nugget of insight that applies outside of the model. This is what we try to do in our <a href="https://www.cis.upenn.edu/~aaroth/Papers/endogenous.pdf">new paper</a> with Chris Jung, Sampath Kannan, Changhwa Lee, Mallesh Pai, and Rakesh Vohra. We'll use the language of criminal justice here, but the model is simple enough that you could apply it to a number of other settings of interest in which we need to design binary classification rules. </div><div><br /></div><div>In our model, individuals make rational choices about whether or not to commit crimes: that is, individuals have some "outside option" (their opportunity for legal employment, for example), some expected monetary benefit of crime, and some dis-utility for being incarcerated. In deciding whether or not to commit a crime, an individual will weigh their expected benefit of committing a crime, compared to taking their outside option ---- and this calculation will involve their risk of being incarcerated if they commit a crime, and also if they do not (since inevitably any policy will both occasionally free the guilty as well as incarcerate the innocent). Different people might make different decisions because their benefits and costs of crime may differ --- for example, some people will have better opportunities for legal employment than others. And in our model, the only way two different populations differ is in their distributions of these benefits and costs. Each person draws, i.i.d. from a distribution corresponding to their group, a type which encodes this outside option value and cost for incarceration. So in our model, populations differ e.g. only in their access to legal employment opportunities, and this is what will underlie any difference in criminal base rates. </div><div><br /></div><div>As a function of whether each person commits a crime or not, a "noisy signal" is generated. In general, think of higher signals as corresponding to increased evidence of guilt, and so if someone commits a crime, they will tend to draw higher signals than those who don't commit crimes --- but the signals are noisy, so there is no way to perfectly identify the guilty. </div><div><br /></div><div>Incarceration decisions are made as a function of these noisy signals: society has a choice as to what incarceration rule to choose, and can potentially choose a different rule for different groups. Once an incarceration rule is chosen, this determines each person's incentive to commit crime, which in turn fixes a base rate of crime in each population. In general, base rates will be different across different groups (because outside option distributions differ), so the impossibility of e.g. equalizing false positive rates, false negative rates, and positive predictive value across groups will hold in our setting. Since crime rates in our setting are a function of the incarceration rule we choose, there is a natural objective to consider: finding the policy that <i>minimizes crime</i>. </div><div><br /></div><div>Lets think about how we might implement different fairness notions in this setting. First, how should we think about posterior probabilities that an individual will commit a crime? Before we see an individual's noisy signal, but after we see his group membership, we can form our <i>prior </i>belief that he has committed a crime --- this is just the base crime rate in his population. After we observe his noisy signal, we can use Bayes rule to calculate a posterior probability that he has committed a crime. So we could apply the "uniform posterior threshold" approach to fairness and use an incarceration rule that would incarcerate an individual exactly when their posterior probability of having committed a crime exceeded some uniform threshold. But note that because crime rates (and hence prior probabilities of crime) will generically differ between populations (because outside option distributions differ), setting the -same- threshold on posterior probability of crime for both groups corresponds to setting <i>different</i> thresholds on the raw noisy signals. This makes sense --- a Bayesian doesn't need as strong evidence to convince her that someone from a high crime group has committed a crime, as she would need to be convinced that someone from a low crime group has committed a crime, because she started off with a higher prior belief about the person from the high crime group. This (as we already know) results in a classification rule that has different false positive rates and false negative rates across groups. </div><div><br /></div><div>On the other hand, if we want to equalize false positive and false negative rates across groups, we need an incarceration rule that sets the same threshold on raw noisy signals, independently of group. This will of course correspond to setting different thresholds on the posterior probability of crime (i.e. thresholding calibrated risk scores differently for different groups). And this will always be sub-optimal from the point of view of predicting crime --- the Bayes optimal predictor uniformly thresholds posterior probabilities. </div><div><br /></div><h2>Which Notions of Fairness Lead to Desirable Outcomes?</h2><div><br /></div><div>But only one of these solutions is consistent with our social goal of minimizing crime. And its not the Bayes optimal predictor. The crime-minimizing solution is the one that sets <i>different</i> thresholds on posterior probabilities (i.e. uniform thresholds on signals) so as to equalize false positive rates and false negative rates. In other words, to minimize crime, society should explicitly commit to <i>not</i> conditioning on group membership, even when group membership is statistically informative for the goal of predicting crime. </div><div><br /></div><div>Why? Its because although using demographic information is statistically informative for the goal of predicting crime when base rates differ, it is not something that is under the control of individuals --- they can control their own choices, but not what group they were born into. And making decisions about individuals using information that is not under their control has the effect of distorting their dis-incentive to commit crime --- it ends up providing less of a dis-incentive to individuals from the higher crime group (since they are more likely to be wrongly incarcerated even if they don't commit a crime). And because in our model people are rational actors, minimizing crime is all about managing incentives. </div><div><br /></div><div>This is our baseline model, and in <a href="https://www.cis.upenn.edu/~aaroth/Papers/endogenous.pdf">the paper</a> we introduce a number of extensions, generalizations, and elaborations on the model in order to stress-test it. The conclusions continue to hold in more elaborate and general settings, but at a high level, the key assumptions that are needed to reach them are that:</div><div><div><ol><li>The underlying base rates are rationally responsive to the decision rule used by society.</li><li>Signals are observed at the same rates across populations, and</li><li>The signals are conditionally independent of an individual’s group, conditioned on the individual’s decision about whether or not to commit crime.</li></ol></div><div>Here, conditions (2) and (3) are unlikely to hold precisely in most situations, but we show that they can be relaxed in various ways while still preserving the core conclusion.</div><div><br /></div><div>But more generally, if we are in a setting in which we believe that individual decisions are rationally made in response to the deployed classifier, and yet the deployed classifier does not equalize false positive and negative rates, then this is an indication that <i>either </i>the deployed classifier is sub-optimal (for the purpose of minimizing crime rates), or that one of conditions (2) and (3) fails to hold. Since in fairness relevant settings, the failure of conditions (2) and (3) is itself undesirable, this can be a diagnostic to highlight discriminatory conditions earlier in the pipeline than the final incarceration rule. In particular, if conditions (2) or (3) fail to hold, then imposing technical fairness constraints on a deployed classifier may be premature, and instead attention should be focused on structural differences in the observations that are being fed into the deployed classifier.</div></div>Aaronhttp://www.blogger.com/profile/09952936358739421126noreply@blogger.com0tag:blogger.com,1999:blog-25562705.post-49676141994357544642019-10-30T16:33:00.000-04:002019-10-30T16:35:11.129-04:00FORC: A new conference you should know about.Here is the CFP: <a href="https://responsiblecomputing.org/forc-2020-call-for-paper/">https://responsiblecomputing.org/forc-2020-call-for-paper/</a><br /><br /><h1 class="headline" style="color: #504741; font-family: Montserrat, "Helvetica Neue", Arial, sans-serif; font-size: 2.7em; letter-spacing: -1.4px; line-height: 1.1; margin: 0px 0px 12px; padding: 0px; text-shadow: rgba(0, 0, 0, 0.1) 3px 3px 0px; text-transform: uppercase;">FORC 2020: CALL FOR PAPERS</h1><div style="color: #666666; font-family: Roboto, Arial, Helvetica, sans-serif; font-size: 16px; padding: 12px 0px; text-align: center;"><b>Symposium on Foundations of Responsible Computing</b></div><div style="color: #666666; font-family: Roboto, Arial, Helvetica, sans-serif; font-size: 16px; padding: 12px 0px;">The Symposium on Foundations of Responsible Computing (FORC) is a forum for mathematical research in computation and society writ large. The Symposium aims to catalyze the formation of a community supportive of the application of theoretical computer science, statistics, economics and other relevant analytical fields to problems of pressing and anticipated societal concern. </div><hr style="background-color: #dddddd; border: 0px; color: #dddddd; font-family: Roboto, Arial, Helvetica, sans-serif; font-size: 16px; height: 1px; margin: 6px 0px 8px; padding: 0px; width: 1024px;" /><div style="color: #666666; font-family: Roboto, Arial, Helvetica, sans-serif; font-size: 16px; padding: 12px 0px;"><b>Important Dates</b></div><div style="color: #666666; font-family: Roboto, Arial, Helvetica, sans-serif; font-size: 16px; padding: 12px 0px;">February 11: Submission Deadline<br />March 23: Notification to Authors<br />April 1: Camera Ready Deadline<br />June 1-3: The conference</div><hr style="background-color: #dddddd; border: 0px; color: #dddddd; font-family: Roboto, Arial, Helvetica, sans-serif; font-size: 16px; height: 1px; margin: 6px 0px 8px; padding: 0px; width: 1024px;" /><div style="color: #666666; font-family: Roboto, Arial, Helvetica, sans-serif; font-size: 16px; padding: 12px 0px;">Any mathematical work on computation and society is welcomed, including topics that are not yet well-established and topics that will arise in the future. This includes the investigation of definitions, algorithms and lower bounds, trade-offs, and economic incentives in a variety of areas. A small sample of topics follow: formal approaches to privacy, including differential privacy; fairness and discrimination in machine learning; bias in the formation of, and diffusion in, social networks; electoral processes and allocation of elected representatives (including redistricting). </div><div style="color: #666666; font-family: Roboto, Arial, Helvetica, sans-serif; font-size: 16px; padding: 12px 0px;">The inaugural FORC will be held on June 1-3 at the Harvard Center for Mathematical Sciences and Applications (CMSA), and will have its proceedings published by LIPIcs. The program committee will review submissions to ensure a high quality program based on novel, rigorous and significant scientific contributions. Authors of accepted papers will have the option of publishing a 10-page version of their paper in the proceedings, or publishing only a 1-page extended abstract, to facilitate the publication of their work in another venue. 1-page abstracts will appear on the website, but not in the proceedings. The symposium itself will feature a mixture of talks by authors of accepted papers and invited talks.</div><div style="color: #666666; font-family: Roboto, Arial, Helvetica, sans-serif; font-size: 16px; padding: 12px 0px;">Authors should upload a PDF of the paper through Easychair: <a href="https://easychair.org/conferences/?conf=forc2020" style="color: #99cc33; text-decoration-line: none; transition: background 0.25s ease 0s, color 0.25s ease 0s;">https://easychair.org/conferences/?conf=forc2020</a>. The font size should be at least 11 point and the paper should be formatted in a single column. Beyond these, there are no formatting or length requirements, but reviewers will only be asked to read the first 10 pages of the paper. It is the authors’ responsibility that the main results of the paper and their significance be clearly stated within the first 10 pages. Submissions should include proofs of all central claims, and the committee will put a premium on writing that conveys clearly and in the simplest possible way what the paper is accomplishing. Authors are free to post their paper on arXiv, etc. Future details will appear on the conference website: <a href="https://responsiblecomputing.org/" style="color: #99cc33; text-decoration-line: none; transition: background 0.25s ease 0s, color 0.25s ease 0s;">https://responsiblecomputing.org/</a>.</div><div style="color: #666666; font-family: Roboto, Arial, Helvetica, sans-serif; font-size: 16px; padding: 12px 0px;"><b>Steering Committee</b></div><div style="color: #666666; font-family: Roboto, Arial, Helvetica, sans-serif; font-size: 16px; padding: 12px 0px;">Avrim Blum<br />Cynthia Dwork<br />Sampath Kannan<br />Jon Kleinberg<br />Shafi Goldwasser<br />Kobbi Nissim<br />Toni Pitassi<br />Omer Reingold<br />Guy Rothblum<br />Salvatore Ruggieri<br />Salil Vadhan<br />Adrian Weller</div><div style="color: #666666; font-family: Roboto, Arial, Helvetica, sans-serif; font-size: 16px; padding: 12px 0px;"><b>Program Committee</b></div><div style="color: #666666; font-family: Roboto, Arial, Helvetica, sans-serif; font-size: 16px; padding: 12px 0px;">Yiling Chen, Harvard<br />Rachel Cummings, Georgia Tech<br />Anupam Datta, Carnegie Mellon University<br />Moritz Hardt, UC Berkeley<br />Nicole Immorlica, Microsoft Research<br />Michael Kearns, University of Pennsylvania<br />Katrina Ligett, Hebrew University<br />Audra McMillan, Boston University and Northeastern<br />Aaron Roth, University of Pennsylvania (<span style="font-weight: 700;">Chair</span>)<br />Guy Rothblum, Weizmann Institute<br />Adam Smith, Boston University<br />Steven Wu, University of Minnesota<br />Jonathan Ullman, Northeastern<br />Jenn Wortman Vaughan, Microsoft Research<br />Suresh Venkatasubramanian, University of Utah<br />Nisheeth Vishnoi, Yale<br />James Zou, Stanford</div>Aaronhttp://www.blogger.com/profile/09952936358739421126noreply@blogger.com0tag:blogger.com,1999:blog-25562705.post-49801433362090930392019-09-10T06:40:00.001-04:002019-09-10T06:40:41.470-04:00A New Analysis of "Adaptive Data Analysis"<div style="text-align: center;"><i>This is a blog post about our new paper, which you can read here: </i><a href="https://arxiv.org/abs/1909.03577">https://arxiv.org/abs/1909.03577</a> </div><div style="text-align: center;"><br /></div>The most basic statistical estimation task is estimating the expected value of some predicate $q$ over a distribution $\mathcal{P}$: $\mathrm{E}_{x \sim \mathcal{P}}[q(x)]$, which I'll just write as $q(\mathcal{P})$. Think about estimating the mean of some feature in your data, or the error rate of a classifier that you have just trained. There's a really obvious way to come up with a good estimate if you've got a dataset $S \sim \mathcal{P}^n$ of $n$ points that were sampled i.i.d. from the distribution: just use the empirical mean $a = \frac{1}{n}\sum_{i=1}^n q(S_i)$! In fact, this is a great way to estimate the values of a really large number of predicates, so long as they were chosen <i>non-adaptively</i>: that is, so long as you came up with all of the predicates you wanted to estimate before you estimated any of the answers. This phenomenon is, for example, what classical generalization theorems in machine learning rely on: the empirical error of a set of classifiers in some model class will be a good estimate of their actual, out-of-sample error, so long as your dataset is at least as large as the logarithm of your model class.<br /><div><br /></div><div>But this guarantee breaks down if the predicates that you want to estimate are chosen in sequence, adaptively. For example, suppose you are trying to fit a machine learning model to data. If you train your first model, estimate its error, and then as a result of the estimate tweak your model and estimate its error again, you are engaging in exactly this kind of adaptivity. If you repeat this many times (as you might when you are tuning hyper-parameters in your model) you could quickly get yourself into big trouble. Of course there is a simple way around this problem: just don't re-use your data. The most naive baseline that gives statistically valid answers is called "data splitting". If you want to test k models in sequence, just randomly partition your data into k equal sized parts, and test each model on a fresh part. The holdout method is just the special case of k = 2. But this "naive" method doesn't make efficient use of data: its data requirements grow <i>linearly</i> with the number of models you want to estimate. </div><div><br /></div><div>It turns out its possible to do better by perturbing the empirical means with a little bit of noise before you use them: this is what we (Dwork, Feldman, Hardt, Pitassi, Reingold, and Roth --- DFHPRR) showed back in 2014 in <a href="https://arxiv.org/abs/1411.2664">this paper</a>, which helped kick off a small subfield known as "adaptive data analysis". In a nutshell, we proved a "transfer theorem" that says the following: if your statistical estimator is simultaneously <i>differentially private</i> and <i>sample accurate</i> --- meaning that with high probability it provides estimates that are close to the empirical means, then it will also be accurate out of sample. When paired with a simple differentially private mechanism for answering queries --- just perturbing their answers with Gaussian noise --- this gave a significant asymptotic improvement in data efficiency over the naive baseline! You can see how well it does in this figure (click to enlarge it): </div><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-Eip3yKJl5mE/XXWfXoqt8KI/AAAAAAAATxY/FMSJHq4ClJgZXOhuACMC5ZMFFQ0W8cmOgCLcBGAs/s1600/DFHPRRonly.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1029" data-original-width="1600" height="205" src="https://1.bp.blogspot.com/-Eip3yKJl5mE/XXWfXoqt8KI/AAAAAAAATxY/FMSJHq4ClJgZXOhuACMC5ZMFFQ0W8cmOgCLcBGAs/s320/DFHPRRonly.png" width="320" /></a></div><div style="text-align: left;">Well... Hmm. In this figure, we are plotting how many adaptively chosen queries can be answered accurately as a function of dataset size. By "accurately" we have arbitrarily chosen to mean: answers that have confidence intervals of width 0.1 and uniform coverage probability 95%. On the x axis, we've plotted the dataset size n, ranging from 100,000 to about 12 million. And we've plotted two methods: the "naive" sample splitting baseline, and using the sophisticated Gaussian perturbation technique, as analyzed by the bound we proved in the "DFHPRR" paper. (Actually --- a numerically optimized variant of that bound!) You can see the problem. Even with a dataset size in the tens of millions, the sophisticated method does substantially worse than the naive method! You can extrapolate from the curve that the DFHPRR bound will <i>eventually</i> beat the naive bound, but it would require a truly enormous dataset. When I try extending the plot out that far my optimizer runs into numeric instability issues. </div><div style="text-align: left;"><br /></div><div style="text-align: left;">There has been improvement since then. In particular, DFHPRR didn't even obtain the best bound asymptotically. It is folklore that differentially private estimates generalize in expectation: the trickier part is to show that they enjoy <i>high probability</i> generalization bounds. This is what we showed in a sub-optimal way in DFHPRR. In 2015, a <a href="https://arxiv.org/abs/1511.02513">beautiful paper</a> by Bassily, Nissim, Steinke, Smith, Stemmer, and Ullman (BNSSSU) introduced the amazing "monitor technique" to obtain the asymptotically optimal high probability bound. An upshot of the bound was that the Gaussian mechanism can be used to answer roughly $k = n^2$ queries --- a quadratic improvement over the naive sample splitting mechanism! You can read about this technique in the lecture notes of the <a href="https://adaptivedataanalysis.com/">adaptive data analysis class</a> Adam Smith and I taught a few years back. Lets see how it does (click to enlarge):</div><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-C83SCyN8uzc/XXWjUcHmTHI/AAAAAAAATxk/GbtveNxoh18E7xOA2HQf67OGj9ujeCAuwCLcBGAs/s1600/DFHPRRandBNSSSU.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1030" data-original-width="1600" height="206" src="https://1.bp.blogspot.com/-C83SCyN8uzc/XXWjUcHmTHI/AAAAAAAATxk/GbtveNxoh18E7xOA2HQf67OGj9ujeCAuwCLcBGAs/s320/DFHPRRandBNSSSU.png" width="320" /></a></div><div style="text-align: left;">Substantially better! Now we're plotting n from 100,000 up to about only 1.7 million. At this scale, the DFHPRR bound appears to be constant (at essentially 0), whereas the BNSSSU bound clearly exhibits quadratic behavior. It even beats the baseline --- by a little bit, so long as your dataset has somewhat more than a million entries... I should add that what we're plotting here is again a numerically optimized variant of the BNSSSU bound, not the closed-form version from their paper. So maybe not yet a practical technique. The problem is that the monitor argument --- while brilliant --- seems unavoidably to lead to large constant overhead. </div><div style="text-align: left;"><br /></div><div style="text-align: left;">Which brings us to our new work (this is joint work with Christopher Jung, Katrina Ligett, Seth Neel, Saeed Sharifi-Malvajerdi, and Moshe Shenfeld). We give a brand new proof of the transfer theorem. It is elementary, and in particular, obtains high probability generalization bounds directly from high probability sample-accuracy bounds, avoiding the need for the monitor argument. I think the proof is the most interesting part --- its simple and (I think) illuminating --- but an upshot is that we get <i>substantially</i> better bounds, even though the improvement is just in the constants (the existing BNSSSU bound is known to be asymptotically tight). Here's the plot with our bound included --- the x-axis is the same, but note the substantially scaled-up y axis (click to enlarge): </div><div style="text-align: left;"><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-oR4HDFR07bE/XXWlv1vVIDI/AAAAAAAATxw/Fqbm-weNbMoQ5WbOgeOY3f0p6N2QN9bMACLcBGAs/s1600/allthree.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1004" data-original-width="1600" height="200" src="https://1.bp.blogspot.com/-oR4HDFR07bE/XXWlv1vVIDI/AAAAAAAATxw/Fqbm-weNbMoQ5WbOgeOY3f0p6N2QN9bMACLcBGAs/s320/allthree.png" width="320" /></a></div><div style="text-align: center;"><br /></div><div style="text-align: left;"><br /><b><u>Proof Sketch</u></b><br />Ok: on to the proof. Here is the trick. In actuality, the dataset S is first sampled from $\mathcal{P}^n$, and then some data analyst interacts with a differentially private statistical estimator, resulting in some transcript $\pi$ of query answer pairs. But now imagine that <i>after the interaction is complete</i>, S is <i>resampled</i> from $Q_\pi = (\mathcal{P}^n)|\pi$, the posterior distribution on datasets conditioned on $\pi$. If you reflect on this for a moment, you'll notice that this resampling experiment doesn't change the joint distribution on dataset transcript pairs $(S,\pi)$ at all. So if the mechanism promised high probability sample accuracy bounds, it still promises them in this resampling experiment. But lets think about what that means: the mechanism can <i>first commit</i> to some set of answers $a_i$, and promise that with high probability, <i>after</i> S is resampled from $Q_\pi$, $|a_i - \frac{1}{n}\sum_{j=1}^n q_i(S_j)|$ is small. But under the resampling experiment, it is quite likely that the empirical value of the query $\frac{1}{n}\sum_{j=1}^n q_i(S_j)$ will end up being close to its expectation over the posterior: $q_i(Q_{\pi}) = \mathrm{E}_{S \sim Q_{\pi}}[\frac{1}{n}\sum_{j=1}^n q_i(S_j)]$. So the only way that a mechanism can promise high probability sample accuracy is if it actually promises high probability posterior accuracy: i.e. with high probability, for every query $q_i$ that was asked and answered, we must have that $|a_i - q_i(Q_\pi)|$ is small.<br /><br />That part of the argument was generic --- it didn't use differential privacy at all! But it serves to focus attention on these posterior distributions $Q_\pi$ that our statistical estimator induces. And it turns out its not hard to see that the expected value of queries on posteriors induced by differentially private mechanisms have to be close to their true answers. For $(\epsilon,0)$-differential privacy, it follows almost immediately from the definition. Here is the derivation. Pick your favorite query $q$ and your favorite transcript $\pi$, and write $S_j \sim S$ to denote a uniformly randomly selected element of a dataset $S$:</div><div style="text-align: left;">$$q (Q_\pi) = \sum_{x} q (x) \cdot \Pr_{S \sim \mathcal{P}^n, S_j \sim S} [S_j = x | \pi]= \sum_{x} q (x) \cdot \frac{\Pr [\pi | S_j = x ] \cdot \Pr_{S \sim \mathcal{P}^n, S_j \sim S} [S_j = x]}{\Pr[\pi]}$$<br />$$\leq \sum_{x} q (x) \cdot \frac{e^\epsilon \Pr [\pi] \cdot \Pr_{S_j \sim \mathcal{P}} [S_j = x]}{\Pr[\pi]}<br />= e^\epsilon \cdot q (\mathcal{P})$$<br /><br />Here, the inequality follows from the definition of differential privacy, which controls the effect that fixing a single element of the dataset to any value $(S_j = x)$ can have on the probability of any transcript: it can increase it multiplicatively by a factor of at most $e^\epsilon$. </div><div style="text-align: left;"><br />And thats it: So we must have that (with probability 1!), $|q(Q_\pi) - q(\mathcal{P})| \leq e^\epsilon-1 \approx \epsilon$. The transfer theorem then follows from the triangle inequality. We get a high probability bound for free, with no need for any heavy machinery.<br /><br />The argument is just a little more delicate in the case of $(\epsilon,\delta)$-differential privacy, and can be extended beyond linear queries --- but I think this gives the basic idea. The details are in <a href="https://arxiv.org/abs/1909.03577">our new "JLNRSS" paper</a>. Incidentally, once nice thing about having many different proofs of the same theorem is that you can start to see some commonalities. One seems to be: it takes six authors to prove a transfer theorem!</div><div style="text-align: left;"><br /></div><div style="text-align: left;"><br /></div>Aaronhttp://www.blogger.com/profile/09952936358739421126noreply@blogger.com0tag:blogger.com,1999:blog-25562705.post-8317394468334396862019-05-28T06:47:00.001-04:002019-05-28T06:47:35.193-04:00Individual Notions of Fairness You Can Use<div style="text-align: center;"><span style="font-size: x-large;"><u>Individual Notions of Fairness You Can Use</u></span></div><br />Our group at Penn has been thinking about when <i>individual </i>notions of fairness might be practically achievable for awhile, and we have <a href="https://arxiv.org/abs/1905.10660">two</a> <a href="https://arxiv.org/abs/1905.10607">new</a> approaches.<br /><br /><span style="font-size: large;"><u>Background</u>:</span><br /><u>Statistical Fairness</u><br />I've written about this before, <a href="http://aaronsadventures.blogspot.com/2017/11/between-statistical-and-individual.html">here</a>. But briefly: there are two families of definitions in the fairness in machine learning literature. The first group of definitions, which I call <i>statistical</i> fairness notions, is far and away the most popular. If you want to come up with your own statistical fairness notion, you can follow this recipe:<br /><ol><li>Partition the world into a small number of "protected sub-groups". You will probably be thinking along the lines of race or gender or something similar when you do this.</li><li>Pick your favorite error/accuracy metric for a classifier. This might literally be classification error, or false positive or false negative rate, or positive predictive value, or something else. Lots of options here. </li><li>Ask that this metric be approximately equalized across your protected groups.</li><li>Finally, enjoy your new statistical fairness measure! Congratulations!</li></ol><div>These definitions are far and away the most popular in this literature, in large part (I think) because they are so immediately actionable. Because they are defined as conditions on a small number of expectations, you can easily check whether your classifier is "fair" according to these metrics, and (although there are some interesting computational challenges) go and try and learn classifiers subject to these constraints. </div><div><br /></div><div>Their major problem is related to the reason for their success: they are defined as conditions on a small number of expectations or <i>averages</i> over people, and so they don't promise much to particular individuals. I'll borrow an example from our <a href="https://arxiv.org/abs/1711.05144">fairness gerrymandering</a> paper from a few years ago to put this in sharp relief. Imagine that we are building a system to decide who to incarcerate, and we want to be "fair" with respect to both gender (men and women) and race (green and blue people). We decide that in our scenario, it is the false positives who are harmed (innocent people sent to jail), and so to be fair, we decide should equalize the false positive rate: across men and women, and across greens and blues. But one way to do this is to jail all green men and blue women. This does indeed equalize the false positive rate (at 50%) across all four of the groups we specified, but is cold comfort if you happen to be a green man --- since then you will be jailed with certainty. The problem was our fairness constraint was never a promise to an individual to begin with, just a promise about the average behavior of our classifier over a large group. And although this is a toy example constructed to make a point, things like this happen in real data too. </div><br /><u>Individual Fairness</u><br />Individual notions of fairness, on the other hand, really do correspond to promises made to individuals. There are at least two kinds of individual fairness definitions that have been proposed: <a href="https://dl.acm.org/citation.cfm?id=2090255">metric fairness</a>, and <a href="http://papers.nips.cc/paper/6355-fairness-in-learning-classic-and-contextual-bandits">weakly meritocratic fairness</a>. Metric fairness proposes that the learner will be handed a <i>task specific similarity metric</i>, and requires that individuals who are close together in the metric should have a similar probability of being classified as positive. Weakly meritocratic fairness, on the other hand, takes the (unknown) labels of an individual as a measure of merit, and requires that individuals who have a higher probability of really having a positive label should have only a higher probability of being classified as positive. This in particular implies that false positive and false negative rates should be equalized <i>across individuals</i>, where now the word <i>rate</i> is averaging over only the randomness of the classifier, not over people. What makes both of these <i>individual</i> notions of fairness is that they impose constraints that bind on all pairs of individuals and not just over averages of people.<br /><br />Definitions like this have the advantage of strong individual-level semantics, which the statistical definitions don't have. But they also have big problems: for metric fairness, the obvious question is: <i>where does the metric come from</i>? Even granting that fairness should be some Lipschitz condition on a metric, it seems hard to pin down what the metric is, and different people will disagree: coming up with the metric seems to encapsulate a large part of the original problem of defining fairness. For weakly meritocratic fairness, the obvious problem is that we don't know what the labels are. Its possible to do non-trivial things if you make assumptions about the label generating process, but its not at all clear you can do any non-trivial learning subject to this constraint if you don't make strong assumptions.<br /><br /><span style="font-size: large;"><u>Two New Approaches:</u></span><br />We have two new approaches, building off of metric fairness and weakly meritocratic fairness respectively. Both have the advantages of statistical notions of fairness in that they can be put into practice without making unrealistic assumptions about the data, and without needing to wait on someone to hand us a metric. But they continue to make meaningful promises to individuals.<br /><br /><u>Subjective Individual Fairness</u><br />Lets start with our variant of metric fairness, which we call subjective individual fairness. (This is joint work with Michael Kearns, our PhD students Chris Jung and Seth Neel, our former PhD student Steven Wu, and Steven's student (our grand student!) Logan Stapleton). The paper is here: <a href="https://arxiv.org/abs/1905.10660">https://arxiv.org/abs/1905.10660</a>. We stick with the premise that "similar people should be treated similarly", and that whether or not it is correct/just/etc., it is at least fair to treat two people the same way, in the sense that we classify them as positive with the same probability. But we don't want to assume anything else.<br /><br />Suppose I were to create a machine learning fairness panel: I could recruit "AI Ethics" experts, moral philosophers, hyped up consultants, people off the street, toddlers, etc. I would expect that there would be as many different conceptions of fairness as there were people on the panel, and that none of them could precisely quantify what they meant by fairness --- certainly not in the form of a "fairness metric". But I could still ask these people, in particular cases, if they thought it was fair that two particular individuals be treated differently or not.<br /><br />Of course, I would have no reason to expect that the responses that I got from the different panelists would be consistent with one another --- or possibly even internally consistent (we won't assume, e.g. that the responses satisfy any kind of triangle inequality). Nevertheless, once we fix a data distribution and a group of people who have opinions about fairness, we have a well defined tradeoff we can hope to manage: any classifier we could choose will have both:<br /><ol><li>Some error rate, and</li><li>Some frequency with which it makes a pair of decisions that someone in the group finds unfair. </li></ol><div>We can hope to find classifiers that optimally trade off 1 and 2: note this is a coherent tradeoff even though we haven't forced the people to try and express their conceptions of fairness into some consistent metric. What we show is that you can do this. </div><div><br /></div><div>Specifically, given a set of pairs that we have determined should be treated similarly, there is an <i>oracle efficient </i>algorithm that can find the optimal classifier subject to the constraint that no pair of individuals that has been specified as a constraint should have a substantially different probability of positive classification. Oracle efficiency means that what we can do is reduce the "fair learning" problem to a regular old learning problem, without fairness constraints. If we can solve the regular learning problem, we can also solve the fair learning problem. This kind of fairness constraint also generalizes in the standard way: if you ask your fairness panel about a reasonably small number of pairs, and then solve the in-sample problem subject to these constraints, the classifier you learn will also satisfy the fairness constraints out of sample. And it works: we implement the algorithm and try it out on the COMPAS data set, with fairness constraints that we elicited from 43 human (undergrad) subjects. The interesting thing is that once you have an algorithm like this, it isn't only a tool to create "fair" machine learning models: its also a new instrument to investigate human conceptions of fairness. We already see quite a bit of variation among our 43 subjects in our preliminary experiments. We plan to pursue this direction more going forward.</div><div><br /></div><div><u>Average Individual Fairness</u></div><div>Next, our variant of weakly meritocratic fairness. This is joint work with Michael Kearns and our student Saeed Sharifi. The paper is here: <a href="https://arxiv.org/abs/1905.10607">https://arxiv.org/abs/1905.10607</a>. In certain scenarios, it really does seem tempting to think about fairness in terms of false positive rates. Criminal justice is a great example, in the sense that it is clear that everyone agrees on which outcome they <i>want</i> (they would like to be released from jail), and so the people we are being unfair to really do seem to be the false positives: the people who should have been released from jail, but who were mistakenly incarcerated for longer. So in our "fairness gerrymandering" example above, maybe the problem with thinking about false positive rates wasn't a problem with <i>false positives</i>, but with <i>rates</i>: i.e. the problem was that the word rate averaged over many people, and so it didn't promise <i>you</i> anything. Our idea is to redefine the word rate. </div><div><br /></div><div>In some (but certainly not all) settings, people are subject to not just one, but many classification tasks. For example, consider online advertising: you might be shown thousands of targeted ads each month. Or applying for schools (a process that is centralized in cities like New York): you apply not just to one school, but to many. In situations like this, we can model the fact that we have not just a distribution over people, but also a distribution over (or collection of) problems. </div><div><br /></div><div>Once we have a distribution over problems, we can define the error rate, or false positive rate, or any other rate you like <i>for individuals. </i>It is now sensible to talk about Alice's false positive rate, or Bob's error rate, because rate has been redefined as an average over problems, for a particular individual. So we can now ask for individual fairness notions in the spirit of the statistical notions of fairness we discussed above! We no longer need to define protected groups: we can now ask that the false positive rates, or error rates, be equalized across all pairs of people. </div><div><br /></div><div>It turns out that given a reasonably sized sample of people, and a reasonably sized sample of problems, it is tractable to find the optimal classifier subject to constraints like this in sample, and that these guarantees generalize out of sample. The in-sample algorithm is again an oracle-efficient algorithm, or in other words, a reduction to standard, unconstrained learning. The generalization guarantee here is a little interesting, because now we are talking about simultaneous generalization in two different directions: to people we haven't seen before, and also to problems we haven't seen before. This requires thinking a little bit about what kind of object we are even trying to output: a mapping from new problems to classifiers. The details are in the paper (spoiler --- the mapping is defined by the optimal dual variables for the empirical risk minimization problem): here, I'll just point out that again, the algorithm is practical to implement, and we perform some simple experiments with it. </div><div><br /></div><div><br /></div><br /><br /><br />Aaronhttp://www.blogger.com/profile/09952936358739421126noreply@blogger.com0tag:blogger.com,1999:blog-25562705.post-7640412054712349752019-04-09T14:18:00.001-04:002019-04-09T14:18:22.630-04:00The Ethical Algorithm<div class="separator" style="clear: both; text-align: center;"><a href="https://2.bp.blogspot.com/-nzPVP9O6SOw/XKzg_FfbvsI/AAAAAAAARnY/V8Ky09-gas8kroK1MTjFkf4vJiwoXefhwCLcBGAs/s1600/Kearns_EthicalAlgorithm_comp1.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1425" data-original-width="938" height="320" src="https://2.bp.blogspot.com/-nzPVP9O6SOw/XKzg_FfbvsI/AAAAAAAARnY/V8Ky09-gas8kroK1MTjFkf4vJiwoXefhwCLcBGAs/s320/Kearns_EthicalAlgorithm_comp1.jpg" width="211" /></a></div>I've had the good fortune to be able work on a number of research topics so far: including privacy, fairness, algorithmic game theory, and adaptive data analysis, and the relationship between all of these things and machine learning. As an academic, we do a lot of writing about the things we work on, but usually our audience is narrow and technical: other researchers in our sub-specialty. But it can be both fun and important to communicate to a wider audience as well. So my amazing colleague Michael Kearns and I wrote a book, called <a href="https://www.amazon.com/Ethical-Algorithm-Science-Socially-Design/dp/0190948205">The Ethical Algorithm</a>. It's coming out in October (Amazon says the release date is November 1st, but my understanding is that pre-orders will start shipping on October 4).<br /><br />This was the first time for either of us writing something like this: it's not a textbook, it's a "trade book" -- a popular science book. Its intended readership isn't just computer science PhDs, but the educated public broadly. But there should be plenty in it to interest experts too, because we cover quite a bit of ground. The book is about the problems that arise when algorithmic decision making interacts with human beings --- and the emerging science about how to fix them. The topics we cover include privacy, fairness, strategic interactions and gaming, and the scientific reproducibility crisis.<br /><br />Lots has been written about the problems that can arise, especially related to privacy and fairness. And we're not trying to reinvent the wheel. Instead, the focus of our book is the exciting and embryonic algorithmic science that has grown to address these issues.<br /><br />The privacy chapter develops differential privacy, and its strengths and weaknesses. The fairness chapter covers recent work on algorithmic fairness that has come out of the computer science literature. The gaming chapter studies how algorithm design can affect the equilibria that emerge from large scale interactions. The reproducibility chapter explores the underlying issues that lead to false discovery, and recent algorithmic advances that hold promise in avoiding them. We try and set expectations appropriately. We don't pretend that the solutions to complex societal problems can be entirely (or even primarily) algorithmic. But we argue that embedding social values into algorithms will inevitably form an important component of any solution.<br /><br />I'm really excited for it to come out. If you want, you can pre-order it now, either at Amazon: <a href="https://www.amazon.com/Ethical-Algorithm-Science-Socially-Design/dp/0190948205">https://www.amazon.com/Ethical-Algorithm-Science-Socially-Design/dp/0190948205</a> or directly from the publisher: <a href="https://global.oup.com/academic/product/the-ethical-algorithm-9780190948207?cc=us&lang=en&#.XKegC9eWi2w.twitter">https://global.oup.com/academic/product/the-ethical-algorithm-9780190948207?cc=us&lang=en&#.XKegC9eWi2w.twitter</a>Aaronhttp://www.blogger.com/profile/09952936358739421126noreply@blogger.com0tag:blogger.com,1999:blog-25562705.post-22516451658743787502019-02-26T15:41:00.000-05:002019-02-26T16:01:16.391-05:00Impossibility Results in Fairness as Bayesian Inference<script type="text/x-mathjax-config"> MathJax.Hub.Config({tex2jax: {inlineMath: [['$','$'], ['\\(','\\)']]}}); </script> <script src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML" type="text/javascript"> </script>One of the most striking results about fairness in machine learning is the impossibility result that <a href="https://www.liebertpub.com/doi/full/10.1089/big.2016.0047">Alexandra Chouldechova</a>, and separately<a href="https://arxiv.org/abs/1609.05807"> Jon Kleinberg, Sendhil Mullainathan, and Manish Raghavan</a> discovered a few years ago. These papers say something very crisp. I'll focus here on the binary classification setting that Alex studies because it is much simpler. There are (at least) three reasonable properties you would want your "fair" classifiers to have. They are:<br /><div><ol><li><b>False Positive Rate Balance</b>: The rate at which your classifier makes errors in the positive direction (i.e. labels negative examples positive) should be the same across groups.</li><li><b>False Negative Rate Balance</b>: The rate at which your classifier makes errors in the negative direction (i.e. labels positive examples negative) should be the same across groups.</li><li><b>Predictive Parity</b>: The statistical "meaning" of a positive classification should be the same across groups (we'll be more specific about what this means in a moment)</li></ol><div>What Chouldechova and KMR show is that if you want all three, you are out of luck --- unless you are in one of two very unlikely situations: Either you have a <i>perfect</i> classifier that never errs, or the <i>base rate</i> is exactly the same for both populations --- i.e. both populations have exactly the same frequency of positive examples. If you don't find yourself in one of these two unusual situations, then you have to give up on properties 1, 2, or 3. </div></div><div><br /></div><div>This is discouraging, because there are good reasons to want each of properties 1, 2, and 3. And these aren't measures made up in order to formulate an impossibility result --- they have their root in the <a href="https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing">Propublica/COMPASS controversy</a>. Roughly speaking, Propublica discovered that the COMPASS recidivism prediction algorithm violated false positive and negative rate balance, and they took the position that this made the classifier <i>unfair</i>. Northpointe (the creators of the COMPASS algorithm) responded by saying that their algorithm satisfied predictive parity, and took the position that this made the classifier fair. They were seemingly talking past each other by using two different definitions of what "fair" should mean. What the impossibility result says is that there is no way to satisfy both sides of this debate. </div><div><br /></div><div>So why is this result true? The proof in Alex's paper can't be made simpler --- its already a one liner, following from an algebraic identity. But the first time I read it I didn't have a great intuition for <i>why</i> it held. Viewing the statement through the lens of Bayesian inference made the result very intuitive (at least for me). With this viewpoint, all the impossibility result is saying is: "If you have different <i>priors</i> about some event (say that a released inmate will go on to commit a crime) for two different populations, and you receive evidence of the same strength for both populations, then you will have different posteriors as well". This is now bordering on obvious --- because your posterior belief about an event is a combination of your prior belief and the new evidence you have received, weighted by the strength of that evidence. </div><div><br /></div><div>Lets walk through this. Suppose we have two populations, call them $A$s and $B$s. Individuals $x$ from these populations have some true binary label $\ell(x) \in \{0,1\}$ which we are trying to predict. Individuals from the two populations are drawn from different distributions, which we'll call $D_A$ and $D_B$. We have some classifier that predicts labels $\hat\ell(x)$, and we would like it to satisfy the three fairness criteria defined above. First, lets define some terms:<br /><br />The <i>base rate</i> for a population $i$ is just the frequency of positive labels:<br />$$p_i = \Pr_{x \sim D_i}[\ell(x) = 1].$$<br />The <i>false positive </i>and <i>false negative </i>rates of the classifier are:<br />$$FPR_i = \Pr_{x \sim D_i}[\hat\ell(x) = 1 | \ell(x) = 0] \ \ \ \ FNR_i = \Pr_{x \sim D_i}[\hat\ell(x) = 0 | \ell(x) = 1].$$<br />And the <i>positive predictive value</i> of the classifier is:<br />$$PPV_i = \Pr_{x \sim D_i}[\ell(x) = 1 | \hat\ell(x)=1].$$</div><div>Satisfying all three fairness constraints just means finding a classifier such that $FPR_A = FPR_B$, $FNR_A = FNR_B$, and $PPV_A = PPV_B$.<br /><br />How should we prove that this is impossible? All three of these quantities are conditional probabilities, so we are essentially obligated to apply Bayes Rule:<br />$$PPV_i = \Pr_{x \sim D_i}[\ell(x) = 1 | \hat\ell(x)=1] = \frac{ \Pr_{x \sim D_i}[\hat\ell(x)=1 | \ell(x) = 1]\cdot \Pr_{x \sim D_i} [\ell(x) = 1]}{ \Pr_{x \sim D_i}[\hat \ell(x) = 1]}$$<br />But now these quantities on the right hand side are things we have names for. Substituting in, we get:<br />$$PPV_i = \frac{p_i(1-FNR_i)}{p_i(1-FNR_i) + (1-p_i)FPR_i}$$<br /><br />And so now we see the problem. Suppose we have $FNR_A = FNR_B$ and $FPR_A = FPR_B$. Can we have $PPV_A = PPV_B$? There are only two ways. If $p_A = p_B$, then we are done, because the right hand side is the same for either $i \in \{A,B\}$. But if the base rates are different, then the only way to make these two quantities equal is if $FNR_i = FPR_i = 0$ --- i.e. if our classifier is perfect.<br /><br />The piece of intuition here is that the base rate is our prior belief that $\ell(x) = 1$, before we see the output of the classifier. The positive predictive value is our <i>posterior</i> belief that $\ell(x) = 1$, after we see the output of the classifier. And all we need to know about the classifier in order to apply Bayes rule to derive our posterior from our prior is its false positive rate and its false negative rate --- these fully characterize the "strength of the evidence." Hence: "If our prior probabilities differ, and we see evidence of a positive label of the same strength, then our posterior probabilities will differ as well."<br /><br />Once you realize this, then you can generalize the fairness impossibility result to other settings by making equally obvious statements about probability elsewhere. :-)<br /><br />For example, suppose we generalize the labels to be real valued instead of binary --- so when making decisions, we can model individuals using shades of gray. (e.g. in college admissions, we don't have to model individuals as "qualified" or not, but rather can model talent as a real value.) Lets fix a model for concreteness, but the particulars are not important. (The model here is related to my paper with <a href="https://www.cis.upenn.edu/~kannan/">Sampath Kannan</a> and<a href="http://www.its.caltech.edu/~jziani/"> Juba Ziani </a>on <a href="https://arxiv.org/abs/1808.09004">the downstream effects of affirmative action</a>)<br /><br />Suppose that in population $i$, labels are distributed according to a Gaussian distribution with mean $\mu_i$: $\ell(x) \sim N(\mu_i, 1)$. For an individual from group $i$, we have a test that gives an unbiased estimator of their label, with some standard deviation $\sigma_i$: $\hat \ell(x) \sim N(\ell(x), \sigma_i)$.<br /><br />In a model like this, we have analogues of our fairness desiderata in the binary case:<br /><br /><ul><li><b>Analogue of Error Rate Balance</b><i style="font-weight: bold;">: </i>We would like our test to be equally informative about both populations: $\sigma_A = \sigma_B$. </li><li><b>Analogue of Predictive Parity</b>: Any test score $t$ should induce the same posterior expectation on true labels across populations: $$E_{D_A}[\ell(x) | \hat \ell(x) = t] = E_{D_B}[\ell(x) | \hat \ell(x) = t]$$ </li></ul><div>Can we satisfy both of these conditions at the same time? Because the normal distribution is <i>self conjugate</i> (that's why we chose it!) Bayes Rule simplifies to have a nice closed form, and we can compute our posteriors as follows:</div><div>$$E_{D_i}[\ell(x) | \hat \ell(x) = t] = \frac{\sigma_i^2}{\sigma_i^2 + 1}\cdot \mu_i + \frac{1}{\sigma_i^2 + 1}\cdot t$$</div><div>So there are only two ways we can achieve both properties:</div><div><ol><li>We can of course satisfy both conditions if the prior distributions are the same for both groups: $\mu_A = \mu_B$. Then we can set $\sigma_A = \sigma_B$ and observe that the right hand side of the above expression is identical for $i \in \{A, B\}$.</li><li>We can also satisfy both conditions if the prior means are different, but the signal is perfect: i.e. $\sigma_A = \sigma_B = 0$. (Then both posterior means are just $t$, independent of the prior means). </li></ol></div>But we can see from inspection these are the only two cases. If $\sigma_A = \sigma_B$, but the prior means are different, then the posterior means will be different for every $t$. This is really the same impossibility result as in the binary case: all it is saying is that if I have different priors about different groups, but the evidence I receive has the same strength, then my posteriors will also be different.<br /><br />So the mathematical fact is simple --- but its implications remain deep. It means we have to choose between equalizing a decision maker's posterior about the label of an individual, or providing an equally accurate signal about each individual, and that we cannot have both. Unfortunately, living without either one of these conditions can lead to real harm.<br /><br /></div><div><br /></div>Aaronhttp://www.blogger.com/profile/09952936358739421126noreply@blogger.com0tag:blogger.com,1999:blog-25562705.post-25591568418001877152019-01-26T17:19:00.001-05:002019-01-28T08:54:32.268-05:00Algorithmic Unfairness Without Any Bias Baked InDiscussion of (un)fairness in machine learning hit mainstream political discourse this week, when Representative Alexandria Ocasio-Cortez discussed the possibility of algorithmic bias, and was clumsily "called out" by Ryan Saavedra on twitter: <br /><center><blockquote class="twitter-tweet" data-lang="en"><div dir="ltr" lang="en">Socialist Rep. Alexandria Ocasio-Cortez (D-NY) claims that algorithms, which are driven by math, are racist <a href="https://t.co/X2veVvAU1H">pic.twitter.com/X2veVvAU1H</a>— Ryan Saavedra (@RealSaavedra) <a href="https://twitter.com/RealSaavedra/status/1087627739861897216?ref_src=twsrc%5Etfw">January 22, 2019</a><script async="" charset="utf-8" src="https://platform.twitter.com/widgets.js"></script></div></blockquote></center>It was gratifying to see the number of responses pointing out how wrong he was --- awareness of algorithmic bias has clearly become pervasive! But most of the pushback focused on the possibility of bias being "baked in" by the designer of the algorithm, or because of latent bias embedded in the data, or both: <br /><center><blockquote class="twitter-tweet" data-conversation="none" data-lang="en"><div dir="ltr" lang="en">You know algorithms are written by people right? And that the data they are trained on is made and selected by people? And that the problems algorithms solve are decided...again...by people? And that people can be and many times are racist? Ok now you do the math</div>— kade (@onekade) <a href="https://twitter.com/onekade/status/1087853353000939521?ref_src=twsrc%5Etfw">January 22, 2019</a></blockquote><script async="" charset="utf-8" src="https://platform.twitter.com/widgets.js"></script></center>Bias in the data is certainly a problem, especially when labels are gathered by human beings. But its far from being the only problem. In this post, I want to walk through a very simple example in which the algorithm designer is being entirely reasonable, there are no human beings injecting bias into the labels, and yet the resulting outcome is "unfair". Here is the (toy) scenario -- the specifics aren't important. High school students are applying to college, and each student has some innate "talent" $I$, which we will imagine is normally distributed, with mean 100 and standard deviation 15: $I \sim N(100,15)$. The college would like to admit students who are sufficiently talented --- say one standard deviation above the mean (so, it would like to admit students with $I \geq 115$). The problem is that talent isn't directly observable. Instead, the college can observe <i>grades</i> $g$ and <i>SAT scores $s$</i>, which are a noisy estimate of talent. For simplicity, lets imagine that both grades and SAT scores are independently and normally distributed, centered at a student's talent level, and also with standard deviation 15: $g \sim N(I, 15)$, $s \sim N(I, 15)$.<br /><br />In this scenario, the college has a simple, optimal decision rule: It should run a linear regression to try and predict student talent from grades and SAT scores, and then it should admit the students whose <i>predicted </i>talent is at least 115. This is indeed "driven by math" --- since we assumed everything was normally distributed here, this turns out to correspond to the Bayesian optimal decision rule for the college.<br /><br />Ok. Now lets suppose there are two populations of students, which we will call Reds and Blues. Reds are the majority population, and Blues are a small minority population --- the Blues's only make up about 1% of the student body. But the Reds and the Blues are no different when it comes to talent: they both have the same talent distribution, as described above. And there is no bias baked into the grading or the exams: both the Reds and the Blues also have exactly the same grade and exam score distributions, as described above.<br /><br />But there is one difference: the Blues have a bit more money than the Reds, so they each take the SAT twice, and report only the highest of the two scores to the college. This results in a small but noticeable bump in their average SAT scores, compared to the Reds. Here are the grades and exam scores for the two populations, plotted:<br /><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-B7-EMI0LIZ8/XEzQO2OBAsI/AAAAAAAAQ3o/PFO_wd6igLgwpA5OCQ1Ux0N7B9A5s3mcACLcBGAs/s1600/gradesexams.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="960" data-original-width="1464" height="261" src="https://1.bp.blogspot.com/-B7-EMI0LIZ8/XEzQO2OBAsI/AAAAAAAAQ3o/PFO_wd6igLgwpA5OCQ1Ux0N7B9A5s3mcACLcBGAs/s400/gradesexams.png" width="400" /></a></div><div class="separator" style="clear: both; text-align: center;"></div>So what is the effect of this when we use our reasonable inference procedure? First, lets consider what happens when we learn two different regression models: one for the Blues, and a different one for the Reds. We don't see much difference:<br /><br />The Red classifier makes errors approximately 11% of the time. The Blue classifier does about the same --- it makes errors about 10.4% of the time. This makes sense: the Blues artificially inflated their SAT score distribution without increasing their talent, and the classifier picked up on this and corrected for it. In fact, it is even a little more accurate!<br /><br />And since we are interested in fairness, lets think about the <i>false negative rate</i> of our classifiers. "False Negatives" in this setting are the people who are qualified to attend the college ($I > 115$), but whom the college mistakenly rejects. These are really the people who have come to harm as a result of the classifier's mistakes. And the False Negative <i>Rate</i> is the probability that a randomly selected qualified person is mistakenly rejected from college --- i.e. the probability that a randomly selected student is harmed by the classifier. We should want that the false negative rates are approximately equal across the two populations: this would mean that the burden of harm caused by the classifier's mistakes is not disproportionately borne by one population over the other. This is one reason why the difference between false negative rates across different populations has become a standard fairness metric in algorithmic fairness --- <a href="http://papers.nips.cc/paper/6373-equality-of-opportunity-in-supervised-learning">sometimes referred to as "equal opportunity."</a><br /><br />So how do we fare on this metric? Not so badly! The Blue model has a false negative rate of 50% on the blues, and the Red model has a false negative rate of 47% on the reds --- so the difference between these two is a satisfyingly small 3%.<br /><br />But you might reasonably object: because we have learned separate models for the Blues and the Reds, we are <i>explicitly </i>making admissions decisions as a function of a student's color! This might sound like a form of discrimination, baked in by the algorithm designer --- and if the two populations represent e.g. racial groups, then its explicitly illegal in a number of settings, including lending.<br /><br />So what happens if we don't allow our classifier to see group membership, and just train one classifier on the whole student body? The gap in false negative rates between the two populations balloons to 12.5%, and the overall error rate ticks up. This means if you are a qualified member of the Red population, you are substantially more likely to be mistakenly rejected by our classifier than if you are a qualified member of the Blue population.<br /><br />What happened? There wasn't any malice anywhere in this data pipeline. Its just that the Red population was much larger than the Blue population, so when we trained a classifier to minimize its average error over the entire student body, it naturally fit the Red population --- which contributed much more to the <i>average</i>. But this means that the classifier was no longer compensating for the artificially inflated SAT scores of the Blues, and so was making a disproportionate number of errors on them --- all in their favor.<br /><br /><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody><tr><td style="text-align: center;"><a href="https://4.bp.blogspot.com/-W3rfiRIJUC0/XEzciybyBfI/AAAAAAAAQ4A/lmABxaYed28K77up20emGgc3TMr8D05QQCLcBGAs/s1600/classifier.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="960" data-original-width="1460" height="262" src="https://4.bp.blogspot.com/-W3rfiRIJUC0/XEzciybyBfI/AAAAAAAAQ4A/lmABxaYed28K77up20emGgc3TMr8D05QQCLcBGAs/s400/classifier.png" width="400" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">The combined admissions rule takes everyone above the black line. Since the Blues are shifted up relative to the Reds, they are admitted at a disproportionately higher rate. </td></tr></tbody></table><div class="separator" style="clear: both; text-align: center;"></div><br /><br />This is the kind of thing that happens <i>all the time</i>: whenever there are two populations that have different feature distributions, learning a single classifier (that is prohibited from discriminating based on population) will fit the bigger of the two populations, simply because they contribute more to average error. Depending on the nature of the distribution difference, this can be either to the benefit or the detriment of the minority population. And not only does this not involve any explicit human bias, either on the part of the algorithm designer or the data gathering process, <i>it is exacerbated if we artificially force the algorithm to be group blind</i>. Well intentioned "fairness" regulations prohibiting decision makers form taking sensitive attributes into account can actually make things less fair and less accurate at the same time.<br /><br /><br /><br /><br />Aaronhttp://www.blogger.com/profile/09952936358739421126noreply@blogger.com0tag:blogger.com,1999:blog-25562705.post-23660775307529850912019-01-10T15:26:00.000-05:002019-01-10T15:27:20.128-05:002019 SIGecom Dissertation Award: Call for Nominations<div class="MsoNormal" style="margin: 0px;">Dear all,<br /><br />Please consider nominating graduating Ph.D. students for the SIGecom Dissertation Award. If you are a graduating student, consider asking your adviser or other senior mentor to nominate you.<br /><br />Nominations are due on February 28, 2019. This award is given to a student who defended a thesis in 2018. It is a prestigious award and is accompanied by a $1500 prize. In the past, the grand prize has been awarded to:<br /><br />2017: Aviad Rubinstein, "Hardness of Approximation Between P and NP"<br />2016: Peng Shi, "Prediction and Optimization in School Choice"<br />2015: Inbal Talgam-Cohen, "Robust Market Design: Information and Computation "<br />2014: S. Matthew Weinberg, "Algorithms for Strategic Agents"<br />2013: Balasubramanian Sivan, "Prior Robust Optimization"<br /><br /><br />And the award has had seven runner-ups: Rachel Cummings, Christos Tzamos, Bo Waggoner, James Wright, Xi (Alice) Gao, Yang Cai, and Sigal Oren. You can find detailed information about the nomination process at: <a href="http://www.sigecom.org/awardd.html">http://www.sigecom.org/awardd.html</a>. We look forward to reading your nominations!<br /><br /><br />Your Award Committee,<br /><b><br /></b><b>Renato Paes Leme</b><br /><b>Aaron Roth</b> (Chair)<br /><b>Inbal Talgam-Cohen</b></div><div class="m_-6389343170732676329gmail-adL" style="background-color: white; color: #222222; font-size: small;"><span class="m_-6389343170732676329gmail-im" style="color: #500050;"></span><br /><div style="font-family: arial, helvetica, sans-serif;"></div></div>Aaronhttp://www.blogger.com/profile/09952936358739421126noreply@blogger.com0tag:blogger.com,1999:blog-25562705.post-79617626242519666022018-03-12T09:58:00.000-04:002018-03-12T09:58:12.150-04:00Call for nominations for the SIGecom Dissertation Award<div style="background-color: white; color: #222222; font-family: arial, sans-serif; font-size: 12.8px;"><div><div><div><div>Dear all,</div><div><br /></div><div>Please consider nominating recently graduated Ph.D. students working in algorithmic game theory/mechanism design/market design for the SIGecom Dissertation Award. If you are a graduating student, consider asking your adviser or other senior mentor to nominate you.<br /><br />Nominations are due at the end of this month, <span class="aBn" data-term="goog_217668748" style="border-bottom: 1px dashed rgb(204, 204, 204); position: relative; top: -2px; z-index: 0;" tabindex="0"><span class="aQJ" style="position: relative; top: 2px; z-index: -1;">March 31, 2018</span></span>. This award is given to a student who defended a thesis in 2017. It is a prestigious award and is accompanied by a $1500 prize. In the past, the grand prize has been awarded to:<br /><br />2016: Peng Shi, " Prediction and Optimization in School Choice"<br />2015: Inbal Talgam-Cohen, " Robust Market Design: Information and Computation "<br />2014: S. Matthew Weinberg, "Algorithms for Strategic Agents"<br />2013: Balasubramanian Sivan, " Prior Robust Optimization"<br /></div>and the award has had five runner-ups, Bo Waggoner, James Wright, Xi (Alice) Gao, Yang Cai, and Sigal Oren. You can find detailed information about the nomination process at: <a data-saferedirecturl="https://www.google.com/url?hl=en&q=http://www.sigecom.org/awardd.html&source=gmail&ust=1520949336465000&usg=AFQjCNElDx295OAKkTeZqXtE4H3LQPDQFQ" href="http://www.sigecom.org/awardd.html" style="color: #1155cc;" target="_blank">http://www.sigecom.org/awardd.<wbr></wbr>html</a>. We look forward to reading your nominations!<br /></div>Your Award Committee,<br /></div>Nicole Immorlica</div>Ariel Procaccia</div><span style="background-color: white; color: #222222; font-family: arial, sans-serif; font-size: 12.8px;">Aaron Roth</span><br /><div><span style="background-color: white; color: #222222; font-family: arial, sans-serif; font-size: 12.8px;"><br /></span></div><div class="yj6qo" style="background-color: white; color: #222222; font-family: arial, sans-serif; font-size: 12.8px;"></div>Aaronhttp://www.blogger.com/profile/09952936358739421126noreply@blogger.com0tag:blogger.com,1999:blog-25562705.post-85755745253144963522018-03-04T21:14:00.000-05:002018-03-06T10:02:27.683-05:00How (un)likely is an "intelligence explosion"?<script type="text/x-mathjax-config"> MathJax.Hub.Config({tex2jax: {inlineMath: [['$','$'], ['\\(','\\)']]}}); </script> <script src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML" type="text/javascript"> </script> <br /><div class="separator" style="clear: both; text-align: center;"></div>I've been having fun recently reading about "AI Risk". There is lots of eloquent writing out there about this topic: I especially recommend <a href="http://slatestarcodex.com/superintelligence-faq/">Scott Alexander's Superintelligence FAQ</a> for those looking for a fun read. The subject has reached the public consciousness, with high profile people like <a href="http://www.newsweek.com/stephen-hawking-artificial-intelligence-warning-destroy-civilization-703630">Stephen Hawking</a> and <a href="https://www.npr.org/sections/thetwo-way/2017/07/17/537686649/elon-musk-warns-governors-artificial-intelligence-poses-existential-risk">Elon Musk</a> speaking publicly about it. There is also an increasing amount of funding and research effort being devoted to understanding AI risk. See for example the <a href="https://www.fhi.ox.ac.uk/">Future of Humanity Institute</a> at Oxford, the<a href="https://futureoflife.org/"> Future of Life Institute</a> at MIT, and the <a href="https://intelligence.org/">Machine Intelligence Research Institute</a> in Berkeley, among others. These groups seem to be doing lots of interesting research, which I am mostly ignorant of. In this post I just want to talk about a simple exercise in asymptotics.<br /><br /><div style="text-align: center;"><b>First, Some Background.</b></div><br />A "superintelligent" AI is loosely defined to be an entity that is much better than we are at essentially any cognitive/learning/planning task. Perhaps, by analogy, a superintelligent AI is to human beings as human beings are to Bengal tigers, in terms of general intelligence. It shouldn't be hard to convince yourself that if we were in the company of a superintelligence, then we would be very right to be worried: after all, it is intelligence that allows human beings to totally dominate the world and drive Bengal tigers to near extinction, despite the fact that tigers physiologically dominate humans in most other respects. This is the case even if the superintelligence doesn't have the destruction of humanity as a goal per-se (after all, we don't have it out for tigers), and even if the superintelligence is just an unconscious but super-powerful optimization algorithm. I won't rehash the arguments here (<a href="http://slatestarcodex.com/superintelligence-faq/">Scott does it better</a>) but it essentially boils down to the fact that it is quite hard to anticipate what the results of optimizing an objective function will be, if the optimization is done over a sufficiently rich space of strategies. And if we get it wrong, and the optimization has some severely unpleasant side-effects? It is tempting to suggest that at that point, we just unplug the computer and start over. The problem is that if we unplug the intelligence, it won't do as well at optimizing its objective function compared to if it took steps to prevent us from unplugging it. So if it's strategy space is rich enough so that it is able to take steps to defend itself, it will. Lots of the most interesting research in this field seems to be about how to align optimization objectives with our own desires, or simply how to write down objective functions that don't induce the optimization algorithm to try and prevent us from unplugging it, while also not incentivizing the algorithm to unplug itself (the <a href="https://intelligence.org/files/Corrigibility.pdf">corrigibility </a>problem).<br /><br />Ok. It seems uncontroversial that a hypothetical superintelligence would be something we should take very seriously as a danger. But isn't it premature to worry about this, given how far off it seems to be? We aren't even that good at <a href="http://funnytweeter.com/wired-machine-learning-will-take-over-the-worldamazon-we-see-you-bought-a-wallet-would-you-like-to-buy-another-wallet/">making product recommendations</a>, let alone optimization algorithms so powerful that they might inadvertently destroy all of humanity. Even if superintelligence will ultimately be something to take very seriously, are we even in a position to productively think about it now, given how little we know about how such a thing might work at a technical level? This seems to be the position that Andrew Ng was taking, in his much quoted statement that (paraphrasing) worrying about the dangers of super-intelligence right now is like worrying about overpopulation on Mars. Not that it might not eventually be a serious concern, but that we will get a higher return investing our intellectual efforts right now on more immediate problems.<br /><br />The standard counter to this is that super-intelligence might always seem like it is well beyond our current capabilities -- maybe centuries in the future -- until, all of a sudden, it appears as the result of an uncontrollable chain reaction known as an "intelligence explosion", or "singularity". (As far as I can tell, very few people actually think that intelligence growth would exhibit an actual mathematical singularity --- this seems instead to be a metaphor for exponential growth.) If this is what we expect, then now might very well be the time to worry about super-intelligence. The first argument of this form was put forth by British mathematician I.J. Good (of <a href="https://en.wikipedia.org/wiki/Good%E2%80%93Turing_frequency_estimation">Good-Turing Frequency Estimation</a>!):<br /><blockquote class="tr_bq">“Let an ultraintelligent machine be defined as a machine that can far surpass all the intellectual activities of any man however clever. Since the design of machines is one of these intellectual activities, an ultraintelligent machine could design even better machines; there would then unquestionably be an ‘intelligence explosion,’ and the intelligence of man would be left far behind. Thus the first ultraintelligent machine is the last invention that man need ever make, provided that the machine is docile enough to tell us how to keep it under control.”</blockquote>Scott Alexander summarizes the same argument a bit more quantitatively. In this passage, he is imagining the starting point being a full-brain simulation of Einstein --- except run on faster hardware, so that our simulated Einstein operates at a much faster clock-speed than his historical namesake:<br /><blockquote class="tr_bq">It might, like the historical Einstein, contemplate physics. Or it might contemplate an area very relevant to its own interests: artificial intelligence. In that case, instead of making a revolutionary physics breakthrough every few hours, it will make a revolutionary AI breakthrough every few hours. Each AI breakthrough it makes, it will have the opportunity to reprogram itself to take advantage of its discovery, becoming more intelligent, thus speeding up its breakthroughs further. The cycle will stop only when it reaches some physical limit – some technical challenge to further improvements that even an entity far smarter than Einstein cannot discover a way around. </blockquote><blockquote class="tr_bq">To human programmers, such a cycle would look like a “critical mass”. Before the critical level, any AI advance delivers only modest benefits. But any tiny improvement that pushes an AI above the critical level would result in a feedback loop of inexorable self-improvement all the way up to some stratospheric limit of possible computing power. </blockquote><blockquote class="tr_bq">This feedback loop would be exponential; relatively slow in the beginning, but blindingly fast as it approaches an asymptote. Consider the AI which starts off making forty breakthroughs per year – one every nine days. Now suppose it gains on average a 10% speed improvement with each breakthrough. It starts on January 1. Its first breakthrough comes January 10 or so. Its second comes a little faster, January 18. Its third is a little faster still, January 25. By the beginning of February, it’s sped up to producing one breakthrough every seven days, more or less. By the beginning of March, it’s making about one breakthrough every three days or so. But by March 20, it’s up to one breakthrough a day. By late on the night of March 29, it’s making a breakthrough every second.</blockquote>As far as I can tell, this possibility of an exponentially-paced intelligence explosion is the main argument for folks devoting time to worrying about super-intelligent AI <i>now</i>, even though current technology doesn't give us anything even close. So in the rest of this post, I want to push a little bit on the claim that the feedback loop induced by a self-improving AI would lead to exponential growth, and see what assumptions underlie it.<br /><br /><div style="text-align: center;"><b>A Toy Model for Rates of Self Improvement</b></div><br />Lets write down an extremely simple toy model for how quickly the intelligence of a self improving system would grow, as a function of time. And I want to emphasize that the model I will propose is clearly a toy: it abstracts away everything that is interesting about the problem of designing an AI. But it should be sufficient to focus on a simple question of asymptotics, and the degree to which growth rates depend on the extent to which AI research exhibits diminishing marginal returns on investment. In the model, AI research accumulates with time: at time t, R(t) units of AI research have been conducted. Perhaps think of this as a quantification of the number of AI "breakthroughs" that have been made in Scott Alexander's telling of the intelligence explosion argument. The intelligence of the system at time t, denoted I(t), will be some function of the accumulated research R(t). The model will make two assumptions:<br /><br /><ol><li>The rate at which research is conducted is directly proportional to the current intelligence of the system. We can think about this either as a discrete dynamics, or as a differential equation. In the discrete case, we have: $R(t+1) = R(t) + I(t)$, and in the continuous case: $\frac{dR}{dt} = I(t)$. </li><li>The relationship between the current intelligence of the system and the currently accumulated quantity of research is governed by some function f: $I(t) = f(R(t))$.</li></ol><div>The function f can be thought of as capturing the <i>marginal rate of return </i>of additional research on the actual intelligence of an AI. For example, if we think AI research is something like pumping water from a well --- a task for which doubling the work doubles the return --- then, we would model f as linear: $f(x) = x$. In this case, AI research does not exhibit any diminishing marginal returns: a unit of research gives us just as much benefit in terms of increased intelligence, no matter how much we already understand about intelligence. On the other hand, if we think that AI research should exhibit diminishing marginal returns --- as many creative endeavors seem to --- then we would model f as an increasing concave function. For example, we might let $f(x) = \sqrt{x}$, or $f(x) = x^{2/3}$, or $f(x) = x^{1/3}$, etc. If we are really pessimistic about the difficulty of AI, we might even model $f(x) = \log(x)$. In these cases, intelligence is still increasing in research effort, but the rate of increase as a function of research effort is diminishing, as we understand more and more about AI. Note however that the rate at which research is being conducted is increasing, which might still lead us to exponential growth in intelligence, if it increases fast enough.</div><div><br /></div><div>So how does our choice of f affect intelligence growth rates? First, lets consider the case in which $f(x) = x$ --- the case of no diminishing marginal returns on research investment. Here is a plot of the growth over 1000 time steps in the discrete model: </div><div class="separator" style="clear: both; text-align: center;"><a href="https://2.bp.blogspot.com/-NbHTMjVG8X4/WpwfWK9686I/AAAAAAAAMFk/wvch5-xLSukezDHSXs6FADECuPDiah_qgCEwYBhgL/s1600/linear.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="273" data-original-width="380" height="229" src="https://2.bp.blogspot.com/-NbHTMjVG8X4/WpwfWK9686I/AAAAAAAAMFk/wvch5-xLSukezDHSXs6FADECuPDiah_qgCEwYBhgL/s320/linear.png" width="320" /></a></div><div>Here, we see exponential growth in intelligence. (It isn't hard to directly work out that in this case, in the discrete model, we have $I(t) = 2^t$, and in the continuous model, we have $I(t) = e^t$). And the plot illustrates the argument for worrying about AI risk <i>now</i>. Viewed at this scale, progress in AI appears to plod along at unimpressive levels before suddenly shooting up to an unimaginable level: in this case, a quantity if written down as a decimal that would have more than 300 zeros. </div><div><br /></div><div>It isn't surprising that if we were to model severely diminishing returns --- say $f(x) = \log(x)$, that this would not occur. Below, we plot what happens when $f(x) = \log(x)$, with time taken out all the way to 1,000,000 rather than merely 1000 as in the above plot:</div><div class="separator" style="clear: both; text-align: center;"><a href="https://3.bp.blogspot.com/-r5_3JqvijEA/Wpw7oVIubQI/AAAAAAAAMFw/dvOhuSSeROYVjH3uZWKrhCrcTENgr0xvwCLcBGAs/s1600/log.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="266" data-original-width="389" height="218" src="https://3.bp.blogspot.com/-r5_3JqvijEA/Wpw7oVIubQI/AAAAAAAAMFw/dvOhuSSeROYVjH3uZWKrhCrcTENgr0xvwCLcBGAs/s320/log.png" width="320" /></a></div><div>Intelligence growth is not very impressive here. At time 1,000,000 we haven't even reached 17. If you wanted to reach (say) an intelligence level of 30 you'd have to wait an unimaginably long time. In this case, we definitely don't need to worry about an "intelligence explosion", and probably not even about <i>ever </i>reaching anything that could be called a super-intelligence. </div><div><br /></div><div>But what about moderate (polynomial) levels of diminishing marginal returns. What if we take $f(x) = x^{1/3}$? Lets see:</div><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-u2OdxHbY7m8/Wpw8Z4YOUoI/AAAAAAAAMF4/6NYvU_sG7WMkGzqcvnq-MNU2x2K1MA8SQCLcBGAs/s1600/one%2Bthird.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="266" data-original-width="386" height="220" src="https://1.bp.blogspot.com/-u2OdxHbY7m8/Wpw8Z4YOUoI/AAAAAAAAMF4/6NYvU_sG7WMkGzqcvnq-MNU2x2K1MA8SQCLcBGAs/s320/one%2Bthird.png" width="320" /></a></div><div>Ok --- now we are making more progress, but even though intelligence now has a polynomial relationship to research (and research speed is increasing, in a chain reaction!) the rate of growth in intelligence is still <i>decreasing</i>. What about if $f(x) = \sqrt{x}$? Lets see:</div><div class="separator" style="clear: both; text-align: center;"><a href="https://3.bp.blogspot.com/-zjkn7z9-nkE/Wpw87l13nWI/AAAAAAAAMGE/RDV1b77CAV87fUK5geqNTi8CeVuqzkYUwCLcBGAs/s1600/sqrt.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="266" data-original-width="392" height="217" src="https://3.bp.blogspot.com/-zjkn7z9-nkE/Wpw87l13nWI/AAAAAAAAMGE/RDV1b77CAV87fUK5geqNTi8CeVuqzkYUwCLcBGAs/s320/sqrt.png" width="320" /></a></div><div>At least now the rate of growth doesn't seem to be decreasing: but it is growing only linearly with time. Hardly an explosion. Maybe we just need to get more aggressive in our modeling. What if $f(x) = x^{2/3}$? </div><div class="separator" style="clear: both; text-align: center;"><a href="https://1.bp.blogspot.com/-z6tk89sEBzo/WpxUJPiASrI/AAAAAAAAMGU/J62ANBiWncgIdZyRWWgvSWcS4H5YEbeZACLcBGAs/s1600/twothirds.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="266" data-original-width="411" height="207" src="https://1.bp.blogspot.com/-z6tk89sEBzo/WpxUJPiASrI/AAAAAAAAMGU/J62ANBiWncgIdZyRWWgvSWcS4H5YEbeZACLcBGAs/s320/twothirds.png" width="320" /></a></div><div>Ok, now we've got something! At least now the rate of intelligence gains is <i>increasing</i> with time. But it is increasing more slowly than a quadratic function -- a far cry from the exponential growth that characterizes an intelligence explosion. </div><div><br /></div><div>Lets take a break from all of this plotting. The model we wrote down is simple enough that we can just go and solve the differential equation. Suppose we have $f(x) = x^{1-\epsilon}$ for some $\epsilon > 0$. Then the differential equation solves to give us:</div><div>\[I(t) = \left(\left(1+\epsilon t \right)^{1/\epsilon} \right)^{1-\epsilon} \]</div><div>What this means is that for <i>any</i> positive value of $\epsilon$, in this model, intelligence grows at only a polynomial rate. The only way this model gives us exponential growth is if we take $\epsilon \rightarrow 0$, and insist that $f(x) = x$ --- i.e. that the intelligence design problem does not exhibit any diminishing marginal returns at all. </div><div><br /></div><div style="text-align: center;"><b>Thoughts</b></div><div style="text-align: left;">So what do we learn from this exercise? Of course one can quibble with the details of the model, and one can believe different things about what form for the function f best approximates reality. But for me, this model helps crystallize the extent to which the "exponential intelligence explosion" story crucially relies on intelligence design being one of those rare tasks that doesn't exhibit any decreasing marginal returns on effort at all. This seems unlikely to me, and <a href="https://simplystatistics.org/2014/03/20/the-8020-rule-of-statistical-methods-development/">counter to experience</a>. </div><div style="text-align: left;"><br /></div><div style="text-align: left;">Of course, there <i>are</i> technological processes out there that do appear to exhibit exponential growth, at least for a little while. <a href="https://en.wikipedia.org/wiki/Moore%27s_law">Moore's law</a> is the most salient example. But it is important to remember that even exponential growth for a little while need not <i>seem</i> explosive at human time scales. Doubling every day corresponds to exponential growth, but so does increasing by 1% a year. To <a href="https://freedom-to-tinker.com/2018/01/04/singularity-skepticism-2-why-self-improvement-isnt-enough/">paraphrase Ed Felten</a>: our retirement plans extend beyond depositing a few dollars into a savings account, and waiting for the inevitable "wealth explosion" that will make us unimaginably rich. </div><div style="text-align: left;"><br /></div><div style="text-align: left;"><br /></div><div style="text-align: center;"><b>Postscript</b></div><div style="text-align: left;">I don't claim that anything in this post is either novel or surprising to folks who spend their time thinking about this sort of thing. There is at least <a href="https://arxiv.org/pdf/1702.08495.pdf">one paper </a>that writes down a model including diminishing marginal returns, which yields a linear rate of intelligence growth.</div><div style="text-align: left;"><br /></div><div style="text-align: left;">It is also interesting to note that in the model we wrote down, exponential growth is really a knife edge phenomenon. We already observed that we get exponential growth if $f(x) = x$, but not if $f(x) = x^{1-\epsilon}$ for any $\epsilon > 0$. But what if we have $f(x) = x^{1+\epsilon}$ for $\epsilon > 0$? In that case, we don't get exponential growth either! As Hadi Elzayn pointed out to me, <a href="https://arxiv.org/pdf/1012.1843.pdf">Osgood's Test </a>tell us that in this case, the function $I(t)$ contains an actual mathematical singularity --- it approaches an infinite value in finite time. </div>Aaronhttp://www.blogger.com/profile/09952936358739421126noreply@blogger.com1tag:blogger.com,1999:blog-25562705.post-60148550894013651322018-01-10T20:32:00.002-05:002018-01-10T20:32:54.201-05:00Fairness and The Problem with Exploration: A Smoothed Analysis of the Greedy Algorithm for the Linear Contextual Bandit Problem<b><u>Bandit Problems</u></b><br /><b><u><br /></u></b>"Bandit problems" are a common abstraction in machine learning. The name is supposed to evoke the image of slot machines, which are also known as "One-armed bandits" (or so I am told... Somehow nobody speaks like this in the circles I run in.) In the classic formulation, you imagine yourself standing in front of a bank of slot machines, each of which is different and might have a different distribution on payoffs. You can sequentially decide which machine's arm to pull: when you pull the arm, you get a sample from that machine's reward distribution. Your goal is to pull the arms in such a way so that your average reward approaches what it would have been had you played the optimal strategy in hindsight: i.e. always pulled the arm of the machine that had the highest mean reward. The problem is tricky because you don't know the reward distributions to start. The key feature of the problem is that in order to observe a sample from the reward distribution of a particular machine, you actually have to pull that arm and spend a turn to do it.<br /><div><br /></div><div>There are more complicated variants of this kind of problem, in which your choices may vary from round to round, but have differentiating features. For example, in the <i>linear contextual </i>bandit problem, in every round, you are presented with a number of choices, each of which is represented by some real valued vector, which may be different from round to round. You then get to choose one of them. The reward you get, corresponding to the action you choose, is random --- but its expectation is equal to some unknown linear function of the action's features. The key property of the problem is that you only get to observe the reward of the action you choose. You do not get to observe the reward you <i>would</i> have obtained had you chosen one of the other actions --- this is what makes it a <i>bandit</i> problem. The goal again is to choose actions such that your average reward approaches what it would have been had you played the optimal strategy in hindsight. In this case, with full knowledge of the underlying linear functions, the optimal strategy would simply compute the expected reward of each action, and play the action with highest expected reward (this may now be a different action at each round, since the features change). Again, the problem is tricky because you don't know the linear functions which map features to expected reward. </div><div><br /></div><div>Bandit problems are characterized by the tension between <i>exploration</i> and <i>exploitation. </i>At any given moment, the algorithm has some guess as to what the best action is. (In a linear contextual bandit problem, for example, the algorithm might just run least-squares regression on the examples it has observed rewards for: the point predictions of the regression estimate on the new actions are these best guesses). In order to be able to compete with the optimal policy, the algorithm definitely needs to (eventually) play the action it thinks is the best one, a lot of the time. This is called exploitation: the algorithm is exploiting its knowledge in order to do well. However, since the algorithm does not observe the rewards for actions it does not play, it also will generally need to <i>explore</i>. Consider someone in front of those slot machines: if he has played each machine once, he has some weak beliefs about the payoff distribution for each machine. But those beliefs have resulted from just a single sample, so they may well be very wrong. If he just forever continues to play the machine that has the highest empirical reward, he might play the same machine every day, and he will never learn more about the other machines --- even though one of them might have turned out to really have higher average reward. He has fooled himself, and because he never <i>explores</i>, he continues to be wrong forever. So optimal algorithms have to carefully balance exploration and exploitation in order to do well. </div><div><br /></div><div>There are various clever schemes for trading off exploration and exploitation, but the simplest one is called epsilon-Greedy. At every round, the algorithm flips a coin with bias epsilon. If the coin comes up heads (with probability 1-epsilon), the algorithm <i>exploits</i>: it just greedily chooses the action that it estimates to be best. If the coin comes up tails (with probability epsilon), the algorithm <i>explores: </i>it chooses an action uniformly at random. If you set epsilon appropriately, in many settings, you have the guarantee that the average reward of the algorithm will approach the average reward of the optimal policy, as time goes on. Fancier algorithms for bandit problems more smoothly interpolate between exploration and exploitation: but all of them inevitably have to manage this tradeoff. </div><div><br /><b><u>Fairness Concerns</u></b><br /><br /></div><div>(Contextual) bandit problems are not just an interesting mathematical curiosity, but actually arise all the time in important applications. Here are just a few:</div><div><ul><li>Lending: Banks (and increasingly algorithms) consider applicants for loans. They can look at applicant features, and observe the repayment history of applicants they grant loans to, but they cannot observe counterfactuals: would an applicant not given a loan have defaulted if he had been given the loan? </li><li>Parole Decisions: Judges (and increasingly algorithms) consider inmates for parole. In part, they want to predict which inmates will go on to re-offend if released, and <i>not</i> release those inmates. But it is not possible to observe the counterfactual: would an inmate who was not released have gone on to commit a crime if he had been released? </li><li>Predictive Policing: Police chiefs (and increasingly algorithms) consider where in their city they want to deploy their police. In part, they want to predict where crimes will occur, so they can deploy a heavier police presence in those areas. But they also disproportionately <i>observe</i> crimes in the areas in which police are deployed.</li><li>Sequential Clinical Trials: Certain kinds of drugs affect patients differently, depending on their genetic markers. But for a new drug, the relationship might be unknown. As patients show up, they can be assigned to different clinical trials, corresponding to different drugs --- but we cannot observe the counterfactual: how would a patient have reacted to a drug she was not given? </li></ul><div>I picked these four examples (rather than the more commonly used example of ad targeting) because in each of the above examples, the decision made by the algorithm has an important effect on someone's life. This raises issues of <i>fairness</i>, and as we will see, these fairness questions relate to exploration in at least two different ways. </div><div><br /></div><div>First, there is an issue that was recently raised by <a href="https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2846909">Bird, Barocas, Crawford, Diaz, and Wallach</a>: that exploring can be <i>unfair</i> to the individuals who have the misfortune of being present during the exploration rounds. Consider the example of sequential clinical trials: if a patient shows up on an "exploitation" round, then she is given the treatment that is believed to be the best for her, using a best effort estimate given the information available. But if she shows up during an "exploration" round, she will be given a random treatment --- generally, one that current evidence suggests is not the best for her. Is this fair to her? What if her life depends on the treatment? Exploration is explicitly sacrificing her well-being, for the possible good of future individuals, who might be able to benefit from what we learned from the exploration. It can be that when we are dealing with important decisions about peoples lives, we don't want to sacrifice the welfare of an individual for the good of others. Certainly, our notional patient would prefer not to show up on an exploration round. In various other settings, we can also imagine exploration being repugnant, even though it is in principle necessary for learning. Can we (for example) randomly release inmates on parole, even if we believe them to be high risks for committing more violent crimes? </div></div><div><br /></div><div>There is another, conflicting concern: if we do not explore, we might not correctly learn about the decisions we have to make. And there are a number of reasons we might expect to see insufficient exploration. Exploration is necessary to maximize long-term reward, but decision makers might be myopic. For example, the loan officers actually making lending decisions might be more interested in maximizing their annual bonuses than maximizing the long-term profits of the bank. Even the CEO might be more interested in quarterly share prices. But myopic decision makers won't explore (tangent: We had <a href="https://arxiv.org/abs/1705.02321">a paper at EC</a> thinking about how one might incentivize myopic decision makers to nevertheless be meritocratically fair). The algorithms in use in many settings might also not have been designed properly --- if our learning algorithms don't explicitly take into account the bandit nature of the problem, and just treat it as a supervised classification problem, then they won't explore --- and this kind of mistake is probably extremely common. And finally, as we argued above, we might believe that it is simply not "fair" to explore, and so we intentionally avoid it. But a lack of exploration (and the resulting failure to properly learn) can itself be a source of unfairness. The "feedback loops" that result from a lack of exploration have been blamed by <a href="http://onlinelibrary.wiley.com/doi/10.1111/j.1740-9713.2016.00960.x/full">Lum and Isaac</a> and by <a href="https://arxiv.org/pdf/1706.09847.pdf">Ensign, Friedler, Neville, Scheidegger, and Venkatasubramanian</a> as a primary source of unfairness in predictive policing. (See <a href="https://algorithmicfairness.wordpress.com/2018/01/05/models-need-doubt-the-problematic-modeling-behind-predictive-policing/">Suresh's recent blog post</a>). Such algorithms can over-estimate crime in the poor neighborhoods in which police are deployed, and underestimate crime in the rich neighborhoods. If they don't explore to correct these mis-estimates, they will deploy more police to the poor neighborhoods, and fewer to the rich neighborhoods, which only exacerbates the data collection problem. </div><div><br /><b><u>Our New Paper</u></b><br /><br /></div><div>These two concerns both lead us to wonder how bad things need be if a decision maker <i>doesn't</i> explore, and instead simply runs a greedy algorithm, that <i>exploits</i> at every round. Maybe this is because we believe that greedy algorithms are already being run in many cases, and we want to know how much of a risk we are at for developing pernicious feedback loops. Or maybe we want to run a greedy algorithm by design, because we are in a medical or legal setting in which exploration is unacceptable. This is the question we consider in <a href="https://arxiv.org/abs/1801.03423">a new paper</a>, joint with <a href="https://www.cis.upenn.edu/~kannan/">Sampath Kannan</a>, <a href="http://jamiemorgenstern.com/">Jamie Morgenstern</a>, <a href="http://www.bowaggoner.com/">Bo Waggoner</a>, and <a href="https://www-users.cs.umn.edu/~zsw/">Steven Wu</a>. (P.S. <a href="http://www.bowaggoner.com/">Bo</a> is on the job market right now!)</div><div><br /></div><div>Specifically, we consider the linear contextual bandit problem, described at the beginning of this post. A decision maker must choose amongst a set of actions every day, each of which is endowed with a vector of features. The reward distribution for each action is governed by an unknown <i>linear</i> function of the features. In this case, the greedy algorithm is simply the algorithm that maintains ordinary least squares regression estimates for each action, and always plays the action that maximizes its current point prediction of the reward, using its current regression estimate. </div><div><a href="https://arxiv.org/abs/1704.09011"><br /></a></div><div>Motivated in part by the problem with exploration in sequential drug trials, <a href="https://arxiv.org/abs/1704.09011">Bastani, Bayati, and Khosravi</a> previously studied a two-action variant of this problem, and showed that when the features are drawn stochastically from a distribution satisfying certain strong assumptions, then the greedy algorithm works well: its reward approaches that of the optimal policy, without needing exploration! We take their result as inspiration, and prove a theorem of this flavor under substantially weaker conditions (although our model is slightly different, so the results are technically incomparable). </div><div><br /></div><div>We consider a model in which the number of actions can be arbitrary, and give a <i>smoothed analysis</i>. What that means is we don't require that the actions or their features are drawn from any particular distribution. Instead, we let them be chosen by an adversary, who knows exactly how our algorithm works, and can be arbitrarily sneaky in his attempt to cause our algorithm to fail to learn. Of course, if we stopped there, then the result would be that the greedy algorithm can be made to catastrophically fail to learn with constant probability: it has long been known that exploration is necessary in the worst case. But our adversary has a slightly shaky hand. <i>After</i> he chooses the actions for a given round, the features of those actions are perturbed by a small amount of Gaussian noise, independently in each coordinate. What we show is that this tiny amount of noise is sufficient to cause the greedy algorithm to succeed at learning: it recovers regret bounds (regret is the difference between the cumulative reward of the greedy algorithm, and what the optimal policy would have done) that are close to optimal, up to an inverse polynomial factor of the standard deviation of the Gaussian noise. (The regret bound must grow with the inverse of the standard deviation of the noise in any smoothed analysis, since we know that if there is no noise, the greedy algorithm doesn't work...) </div><div><br /></div><div>The story is actually more nuanced: we consider two different models, derive qualitatively different bounds in each model, and prove a separation: see <a href="https://arxiv.org/abs/1801.03423">the paper</a> for details. But the punchline is that, at least in linear settings, what we can show is that "generically" (i.e. in the presence of small perturbations), fast learning is possible in bandit settings, even without exploration. This can be taken to mean that in settings in which exploration is repugnant, we don't have to do it --- and that perhaps pernicious feedback loops that can arise when we fail to explore shouldn't be expected to persist for a long time, if the features we observe are a little noisy. Of course, everything we prove is in a linear setting. We still don't know the extent to which these kinds of smoothed analyses carry over into non-linear settings. Understanding the necessity of exploration in more complex settings seems like an important problem with real policy implications. </div><div><br /></div><div></div><div><br /></div>Aaronhttp://www.blogger.com/profile/09952936358739421126noreply@blogger.com0tag:blogger.com,1999:blog-25562705.post-77650911788845927562017-12-17T20:52:00.000-05:002017-12-17T20:53:13.990-05:00Adaptive Data Analysis Class NotesAdam Smith and I both taught a PhD seminar this semester (at BU and Penn respectively) on adaptive data analysis. We collaborated on the course notes, which can be found here: <a href="https://adaptivedataanalysis.com/lecture-schedule-and-notes/">https://adaptivedataanalysis.com/lecture-schedule-and-notes/</a><br /><div><br /></div><div>As part of their final project, students will be writing lecture notes for papers that we didn't have time to cover (listed here: <a href="https://adaptivedataanalysis.com/more-reading/">https://adaptivedataanalysis.com/more-reading/</a> ). I'll post those as they are turned in. </div>Aaronhttp://www.blogger.com/profile/09952936358739421126noreply@blogger.com1tag:blogger.com,1999:blog-25562705.post-20305990434940212012017-11-15T15:09:00.000-05:002017-12-18T13:13:09.997-05:00Between "statistical" and "individual" notions of fairness in Machine LearningIf you follow the fairness in machine learning literature, you might notice after awhile that there are two distinct families of fairness definitions:<br /><br /><ol><li>Statistical definitions of fairness (sometimes also called <i>group</i> fairness), and</li><li>Individual notions of fairness. </li></ol><div>The statistical definitions are far-and-away the most popular, but the individual notions of fairness offer substantially stronger semantic guarantees. In this post, I want to dive into this distinction a little bit. Along the way I'll link to some of my favorite recent fairness papers, and end with a description of a recent paper from our group which tries to get some of the best properties of both worlds.</div><div><br /></div><div><b><u>Statistical Definitions of Fairness</u></b></div><div>At a high level, statistical definitions of fairness partition the world into some set of "protected groups", and then ask that some statistical property be approximately equalized across these groups. Typically, the groups are defined via some high level sensitive attribute, like race or gender. Then, statistical definitions of fairness ask that some relevant statistic about a classifier be equalized across those groups. For example, we could ask that the rate of positive classification be equal across the groups (this is sometimes called <i>statistical parity). </i>Or, we could ask that the false positive and false negative rates be equal across the groups (This is what <a href="http://papers.nips.cc/paper/6374-equality-of-opportunity-in-supervised-learning.pdf">Hardt, Price, and Srebro</a> call "Equalized Odds"). Or, we could ask that the positive predictive value (also sometimes called calibration) be equalized across the groups. These are all interesting definitions, and there are very interesting <a href="https://arxiv.org/abs/1703.00056">impossibility</a> <a href="https://arxiv.org/abs/1609.05807">results </a>that emerge if you think you might want to try and satisfy all of them simultaneously.</div><div><br /></div><div>These definitions are attractive: first of all, you can easily check whether a classifier satisfies them, since verifying that these fairness constraints hold on a particular distribution simply involves estimating some expectations, one for each protected group. It can be <a href="https://arxiv.org/abs/1702.06081">computationally hard</a> to find a classifier that satisfies them, while minimizing the usual convex surrogate losses --- but computational hardness pervades all of learning theory, and hasn't stopped the practice of machine learning. Already there are a number of practical algorithms for learning classifiers subject to these statistical fairness constraints --- this <a href="http://fatml.mysociety.org/media/documents/reductions_approach_to_fair_classification.pdf">recent paper</a> by Agarwal et al. is one of my favorites. </div><div><br /></div><div><i>Semantics </i></div><div>But what are the semantics of these constraints, and why do they correspond to "fairness"? You can make a different argument for each one, but here is the case for equalizing false positive and negative rates across groups (equalized odds), which is one of my favorites: </div><div><br /></div><div>You can argue (making the <i>big</i> assumption that the data is not itself already corrupted with bias --- see <a href="https://arxiv.org/abs/1609.07236">this paper</a> by Friedler et al. for a deep dive into these assumptions...) that a classifier potentially does harm to an individual when it misclassifies them. If the (say) false positive rate is higher amongst black people than white people, then this means that if you perform the thought experiment of imagining yourself as a uniformly random black person with a negative label, the classifier is more likely to harm you than if you imagine yourself as a uniformly random white person with a negative label. On the other hand, if the classifier equalizes false positive and negative rates across populations, then behind the "veil of ignorance", choosing whether you want to be born as a uniformly random black or white person (conditioned on having a negative label), all else being equal, you should be indifferent!</div><div><br /></div><div><i>Concerns</i></div><div>However, in this thought experiment, it is crucial that you are a uniformly random individual from the population. If you know some fact about yourself that makes you think you are not a uniformly random white person, the guarantee can lack bite. Here is a simple example:</div><div><br /></div><div>Suppose we have a population of individuals, with two features and a label. Each individual has a race (either black or white) and a gender (either male or female). The features are uniformly random and independent. Each person also has a label (positive or negative) that is also uniformly random, and independent of the features. We can define four natural protected groups: "Black people', "White people", "Men", and "Women". </div><div><br /></div><div>Now suppose we have a classifier that labels an individual as positive if and only if they are either a black man or a white woman. Across these four protected groups, this classifier appears to be fair: the false positive rate and false negative rate are all exactly 50% on all four of the protected groups, so it satisfies equalized odds. But this hides a serious problem: on two subgroups (black men and white women), the false positive rate is 100%! If being labelled positive is a bad thing, you would not want to be a random (negative) member of those sub-populations.</div><div><br /></div><div>Of course, we could remedy this problem by simply adding more explicitly protected subgroups. But this doesn't scale well: If we have d protected binary features, there are 2^(2^d) such subgroups that we can define. If the features are real valued, things are even worse, and we quickly get into issues of overfitting (see the next section). </div><div><br /></div><div><b><u>Individual Definitions of Fairness</u></b></div><div>At a high level, the problem with statistical definitions of fairness is that they are defined with respect to group averages, so if you are not an "average" member of your group, they don't promise you anything. Individual definitions of fairness attempt to remedy this problem by asking for a constraint that binds at the individual level. The first attempt at this was in the seminal paper <a href="https://dl.acm.org/citation.cfm?id=2090255">"Fairness Through Awareness"</a>, which suggested that fairness should mean that "similar individuals should be treated similarly". Specifically, if the algorithm designer knows the correct metric by which individual similarity should be judged, the constraint requires that individuals who are close in that metric should have similar probabilities of any classification outcome. This is a semantically very strong definition. It hasn't gained much traction in the literature yet, because of the hard problem of specifying the similarity metric. But I expect that it will be an important definition going forward --- there are just some obstacles that have to be overcome. I've been thinking about it a lot myself. </div><div><br /></div><div>Another definition of individual fairness is one that my colleagues and I at Penn introduced in <a href="https://arxiv.org/abs/1605.07139">this paper</a>, which we have taken to calling "Weakly Meritocratic Fairness". I have previously blogged about it <a href="https://aaronsadventures.blogspot.com/2016/05/fairness-in-learning.html">here</a>. We define it in a more general context, and in an online setting, but it makes sense to think about what it means in the special case of batch binary classification as well. In that case, it reduces to three constraints:</div><div><ol><li>For any two individuals with a negative label, the false positive rate on those individuals must be equal</li><li>For any two individuals with a positive label, the false negative rate on those individuals must be equal</li><li>For any pair of individuals A and B such that A has a negative label, and B has a positive label, the true positive rate on B can't be lower than the false positive rate on A. (Assuming that positive is the more desirable label --- otherwise this is reversed). </li></ol><div>Again, we are talking about false positive rates --- but now there is no distribution --- i.e. nothing to take an average over. Instead, the constraints bind on every pair of individuals. As a result, the "rate" we are talking about is calculated only over the randomness of the classifier itself. </div></div><div><br /></div><div>This constraint in particular implies that false positive rates are equal across any sub-population that you might define, even ex-post! Problem solved! Why don't we all just use this definition of fairness? Here is why:</div><div><br /></div><div><i>Concerns</i>:</div><div>Weakly meritocratic fairness is extremely similar (identical, except for the added condition 3) to asking for equalized odds fairness on the (potentially infinitely large) set of "protected groups" corresponding to singletons --- everyone forms their own protected group! As a result, you shouldn't hope to be able to satisfy a fairness definition like this without making some assumption on the data generating process. Even if on your sample of data, it <i>looks like</i> you are satisfying this definition of fairness, you are overfitting --- the complexity of this class of groups is too large, and your in-sample "fairness" won't generalize out of sample. If you <i>are</i> willing to make assumptions on the data generating process, then you can do it! In our original paper, we show how to do it if you assume the labels obey a linear relationship to the features. But assumptions of this sort generally won't hold exactly in practice, which makes this kind of approach difficult to implement in practical settings. </div><div><br /></div><div>The "Fairness Through Awareness" definition at a high level suffers from the same kind of problem: you can only use it if you make a very strong assumption (namely that you have what everyone agrees is the "right" fairness metric). And this is the crux of the problem with definitions of individual fairness: they seem to require such strong assumptions, that we cannot actually realize them in practice. </div><div><br /></div><div><b><u>A Middle Ground</u></b></div><div>Asking for statistical fairness across a small number of coarsely defined groups leaves us open to unfairness on structured subgroups we didn't explicitly designate as protected (what you might call "Fairness Gerrymandering"). On the other hand, asking for statistical fairness across every possible division of the data that can be defined ex-post leaves us with an impossible statistical problem. (Consider, that every imperfect classifier could be accused of being "unfair" to the subgroup that we define, ex-post, to be the set of individuals that the classifier misclassifies! But this is just overfitting...)</div><div><br /></div><div>What if we instead ask for statistical fairness across exponentially many (or infinitely many) subgroups, defined over a set of features we think should be protected, but ask that this family of subgroups itself has bounded VC-dimension? This mitigates the statistical problem --- we can now in principle train "fair" classifiers according to this definition without worrying about overfitting, assuming we we have enough data (proportional to the VC-dimension of the set of groups we want to protect, and the set of classifiers we want to learn). And we can do this without making any assumptions about the data. It also mitigates the "Fairness Gerrymandering" problem --- at least now, we can explicitly protect an enormous number of detailed subgroups, not just coarsely defined ones. </div><div><br /></div><div>This is what we (<a href="http://www.cis.upenn.edu/~mkearns/">Michael Kearns</a>, <a href="https://sethstatistics.wordpress.com/">Seth Neel</a>, <a href="https://www-users.cs.umn.edu/~zsw/">Steven Wu</a>, and I) propose in our <a href="https://arxiv.org/abs/1711.05144v1">new paper</a>. In most of the paper, we investigate the computational challenges surrounding this kind of fairness constraint, when the statistical fairness notion we want is equality of false positive rates, false negative rates, or classification rates (statistical parity):</div><div><ul><li>First, it is no longer clear how to even <i>check</i> whether a fixed classifier satisfies a fairness constraint of this sort, without explicitly enumerating all of the protected groups (and recall, there might now be exponentially or even uncountably infinitely many such subgroups). We call this the <i>Auditing</i> problem. And indeed, in the worst case, this is a hard problem: we show that it is equivalent to weak agnostic learning, which brings with it a long list of computational hardness results from learning theory. It is hard to audit even for simple subgroup structures, definable by boolean conjunctions over features, or linear threshold functions. However, this connection also suggests an algorithmic approach to auditing. The fact that learning linear separators is NP hard in the worst case hasn't stopped us from doing this all the time, with simple heuristics like logistic regression and SVMs, as well as more sophisticated techniques. These same heuristics can be used to solve the auditing problem. </li><li>Going further, lets suppose those heuristics work --- i.e. we have oracles which can optimally solve agnostic learning problems over some collection of classifiers C, and can optimally solve the auditing problem over some class of subgroups G. Then we give an algorithm that only has to maintain small state, and provably converges to the optimal distribution over classifiers in C that equalizes false positive rates (or false negative rates, or classification rates...) over the groups G. Our algorithm draws inspiration from the Agarwal et al. paper: "<a href="http://fatml.mysociety.org/media/documents/reductions_approach_to_fair_classification.pdf">A Reductions Approach to Fair Classification</a>". </li><li>And we can run these algorithms! We use a simple linear regression heuristic to implement both the agnostic learning and auditing "oracles", and run the algorithm to learn a distribution over linear threshold functions (defined over 122 attributes) that approximately equalizes false positive rates across every group definable by a linear threshold function over 18 real valued racially associated attributes, in the "Communities and Crime" dataset. It seems to work and do well! </li></ul></div><div><br /></div><div><b><u>In Conclusion...</u></b></div><div>I've long been bothered by the seemingly irreconcilable gap between individual and group notions of fairness. The individual notions of fairness have the right semantics, but they are unimplementable. The group notions of fairness promise something too weak. Going forward, I think these two schools of thought on fairness have to be reconciled. I think of our work as a small step in that direction.<br /><br /><div style="text-align: center;"><b>Michael recorded a short talk about this work, which you can watch on YouTube: <a href="https://www.youtube.com/watch?v=AjGQHJ0FKMg">https://www.youtube.com/watch?v=AjGQHJ0FKMg</a></b></div></div>Aaronhttp://www.blogger.com/profile/09952936358739421126noreply@blogger.com1tag:blogger.com,1999:blog-25562705.post-82834665498865289382017-10-16T09:49:00.002-04:002017-10-16T09:52:32.284-04:00Call for Nominations for the SIGecom Doctoral Dissertation Award<span class="m_8745097540625121615gmail-m_-8759336033224474935m_1006650683456301089gmail-m_94279031666006434gmail_msg" style="background-color: white; color: #212121; font-family: "arial" , sans-serif; font-size: 13px;">The SIGecom Doctoral Dissertation Award recognizes an outstanding dissertation in the field of economics and computer science. The award is conferred annually at the ACM Conference on Economics and Computation and includes a plaque, complimentary conference registration, and an honorarium of $1,500. A plaque may further be given to up to two runners-up. No award may be conferred if the nominations are judged not to meet the standards for the award.</span><br /><br class="m_8745097540625121615gmail-m_-8759336033224474935m_1006650683456301089gmail-m_94279031666006434gmail_msg" style="background-color: white; color: #212121; font-family: arial, sans-serif; font-size: 13px;" /><span class="m_8745097540625121615gmail-m_-8759336033224474935m_1006650683456301089gmail-m_94279031666006434gmail_msg" style="background-color: white; color: #212121; font-family: "arial" , sans-serif; font-size: 13px;">To be eligible, a dissertation must be on a topic related to the field of economics and computer science and must have been defended successfully during the calendar year preceding the year of the award presentation.</span><br /><br class="m_8745097540625121615gmail-m_-8759336033224474935m_1006650683456301089gmail-m_94279031666006434gmail_msg" style="background-color: white; color: #212121; font-family: arial, sans-serif; font-size: 13px;" /><span class="m_4977556430097751528gmail-m_-8094567887141969947gmail-m_-8759336033224474935m_1006650683456301089gmail-m_94279031666006434gmail_msg" style="background-color: white; color: #212121; font-family: arial, sans-serif; font-size: 13px;">The next SIGecom Doctoral Dissertation Award will be given for dissertations defended in 2017. Nominations are due by the <span class="aBn" data-term="goog_1480864648" style="border-bottom: 1px dashed rgb(204, 204, 204); position: relative; top: -2px; z-index: 0;" tabindex="0"><span class="aQJ" style="position: relative; top: 2px; z-index: -1;">March 31, 2018</span></span>, and must be submitted by email with the subject "SIGecom Doctoral Dissertation Award" to<span class="m_4977556430097751528gmail-m_-8094567887141969947gmail-m_-8759336033224474935m_1006650683456301089gmail-m_94279031666006434inbox-inbox-Apple-converted-space"> </span></span><span class="m_4977556430097751528gmail-m_-8094567887141969947gmail-m_-8759336033224474935m_1006650683456301089gmail-m_94279031666006434gmail_msg" style="background-color: white; color: #212121; font-family: arial, sans-serif; font-size: 13px;">the awards committee at<span class="m_4977556430097751528gmail-m_-8094567887141969947gmail-m_-8759336033224474935m_1006650683456301089gmail-m_94279031666006434inbox-inbox-Apple-converted-space"> </span><a class="m_4977556430097751528gmail-m_-8094567887141969947gmail-m_-8759336033224474935m_1006650683456301089gmail-m_94279031666006434gmail_msg" href="mailto:sigecom-awards-diss@acm.org" style="color: #1155cc;" target="_blank">sigecom-awards-diss@acm.org</a></span><span class="m_4977556430097751528gmail-m_-8094567887141969947gmail-m_-8759336033224474935m_1006650683456301089gmail-m_94279031666006434gmail_msg" style="background-color: white; color: #212121; font-family: arial, sans-serif; font-size: 13px;"><wbr></wbr>. A dissertation may be nominated simultaneously for both the SIGecom Doctoral Dissertation Award and the ACM Doctoral Dissertation Award.</span><br /><br class="m_8745097540625121615gmail-m_-8759336033224474935m_1006650683456301089gmail-m_94279031666006434gmail_msg" style="background-color: white; color: #212121; font-family: arial, sans-serif; font-size: 13px;" /><span class="m_8745097540625121615gmail-m_-8759336033224474935m_1006650683456301089gmail-m_94279031666006434gmail_msg" style="background-color: white; color: #212121; font-family: "arial" , sans-serif; font-size: 13px;">Nominations may be made by any member of SIGecom, and will typically come from the dissertation supervisor. Self-nomination is not allowed. Nominations for the award must include the following, preferably in a single PDF file:</span><br /><br class="m_8745097540625121615gmail-m_-8759336033224474935m_1006650683456301089gmail-m_94279031666006434gmail_msg" style="background-color: white; color: #212121; font-family: arial, sans-serif; font-size: 13px;" /><span class="m_8745097540625121615gmail-m_-8759336033224474935m_1006650683456301089gmail-m_94279031666006434gmail_msg" style="background-color: white; color: #212121; font-family: "arial" , sans-serif; font-size: 13px;">1. A two-page summary of the dissertation, written by the nominee, including bibliographic data and links to publicly accessible versions of published papers based primarily on the dissertation.</span><br /><span class="m_8745097540625121615gmail-m_-8759336033224474935m_1006650683456301089gmail-m_94279031666006434gmail_msg" style="background-color: white; color: #212121; font-family: "arial" , sans-serif; font-size: 13px;">2. An English-language version of the dissertation.</span><br /><span class="m_8745097540625121615gmail-m_-8759336033224474935m_1006650683456301089gmail-m_94279031666006434gmail_msg" style="background-color: white; color: #212121; font-family: "arial" , sans-serif; font-size: 13px;">3. An endorsement letter of no more than two pages by the nominator, arguing the merit of the dissertation, potential impact, and justification of the nomination. This document should also certify the dissertation defense date.</span><br /><span class="m_8745097540625121615gmail-m_-8759336033224474935m_1006650683456301089gmail-m_94279031666006434gmail_msg" style="background-color: white; color: #212121; font-family: "arial" , sans-serif; font-size: 13px;">4. The names, email addresses, and affiliations of at least two additional endorsers.</span><br /><br class="m_8745097540625121615gmail-m_-8759336033224474935m_1006650683456301089gmail-m_94279031666006434gmail_msg" style="background-color: white; color: #212121; font-family: arial, sans-serif; font-size: 13px;" /><span class="m_8745097540625121615gmail-m_-8759336033224474935m_1006650683456301089gmail-m_94279031666006434gmail_msg" style="background-color: white; color: #212121; font-family: "arial" , sans-serif; font-size: 13px;">The additional endorsement letters themselves should be emailed directly to </span><a class="m_8745097540625121615gmail-m_-8759336033224474935m_1006650683456301089gmail-m_94279031666006434gmail_msg" href="mailto:mailto:sigecom-awards-diss@acm.org" style="background-color: white; color: #1155cc; font-family: arial, sans-serif; font-size: 13px;" target="_blank">sigecom-awards-diss@acm.org</a><span class="m_8745097540625121615gmail-m_-8759336033224474935m_1006650683456301089gmail-m_94279031666006434gmail_msg" style="background-color: white; color: #212121; font-family: "arial" , sans-serif; font-size: 13px;"><wbr></wbr>, by the same deadline. These endorsements should be no longer than 500 words, and should specify the relationship of the endorser to the nominee, contributions of the dissertation, and its potential impact on the field.</span><br /><br class="m_8745097540625121615gmail-m_-8759336033224474935m_1006650683456301089gmail-m_94279031666006434gmail_msg" style="background-color: white; color: #212121; font-family: arial, sans-serif; font-size: 13px;" /><span class="m_8745097540625121615gmail-m_-8759336033224474935m_1006650683456301089gmail-m_94279031666006434gmail_msg" style="background-color: white; color: #212121; font-family: "arial" , sans-serif; font-size: 13px;">It is expected that a nominated candidate, if selected for the award, will attend the next ACM Conference on Economics and Computation to accept the award and give a presentation on the dissertation work. The cost of attending the conference is not covered by the award, but complimentary registration is provided.</span>Aaronhttp://www.blogger.com/profile/09952936358739421126noreply@blogger.com0tag:blogger.com,1999:blog-25562705.post-3227076218129484382017-07-16T13:08:00.003-04:002017-07-16T13:08:36.727-04:00Submit your papers to WWW 2018Jennifer Wortman Vaughan and I are the track chairs for the the "Web Economics, Monetization, and Online Markets" track of WWW 2018. The track name is a little unwieldy -- we tried to change it to "Economics and Markets" -- but the focus should be of interest to many in the AGT, theory, and machine learning communities. See the call for papers here: <a href="https://www2018.thewebconf.org/call-for-papers/research-tracks-cfp/web-economics/">https://www2018.thewebconf.org/call-for-papers/research-tracks-cfp/web-economics/</a> We have a great PC, and the topics of interests include (amongst many other things), "Economics of Privacy" and "Fairness in Economic Environments".Aaronhttp://www.blogger.com/profile/09952936358739421126noreply@blogger.com0tag:blogger.com,1999:blog-25562705.post-22207676040616712592016-11-25T15:22:00.002-05:002016-11-28T18:14:57.495-05:00January Fairness Workshop at PennAt the beginning of the semester, I mentioned that after the law school's <a href="https://www.law.upenn.edu/institutes/ppr/optimizing-government-project/">semester of events</a> (click for videos) on fairness, machine learning, and the law, we would host a technical workshop on recent work on fairness in machine learning.<br /><br />We have now finished putting together the program, which will be terrific. The workshop will take place here at Penn from January 19-20th. Take a look at our great line-up of speakers here: <a href="https://sites.google.com/view/fairnessconfererencepenn">https://sites.google.com/view/fairnessconfererencepenn</a><br /><br />The event is open to the public, but registration is required.Aaronhttp://www.blogger.com/profile/09952936358739421126noreply@blogger.com0tag:blogger.com,1999:blog-25562705.post-1276740762591491282016-10-24T10:56:00.001-04:002016-10-24T10:56:48.500-04:00Designing the Digital EconomyI'm on a train to New Haven, where I'll be giving a guest lecture (together with <a href="http://solon.barocas.org/">Solon Barocas</a>) in Glen Weyl's class, "<a href="https://www.dropbox.com/s/4l417liqgnwfihe/DDE_syllabus.docx?dl=0">Designing the Digital Economy</a>" (n.b. I need to get advice from Glen about how to get as good <a href="http://www.nytimes.com/2016/09/04/technology/goodbye-ivory-tower-hello-silicon-valley-candy-store.html">publicity</a> for my classes...)<br /><br />Solon and I will be sharing the 3 hour class, talking about fairness in machine learning, starting at 2:30. Pop by if you are around -- otherwise, <a href="http://www.cis.upenn.edu/~aaroth/Papers/fairtradeoffs.pptx">here</a> are my slides.Aaronhttp://www.blogger.com/profile/09952936358739421126noreply@blogger.com0tag:blogger.com,1999:blog-25562705.post-14134659169001439402016-09-14T09:46:00.005-04:002016-09-22T09:38:27.686-04:00Semester on Fairness and Algorithms at PennThis year, the "Fels Policy Research Initiative" is funding two exciting events, both related to fairness and machine learning. The first, joint between the law school and statistics, is called "Optimizing Government", and will host a series of 4 seminars over the course of this semester touching on technical and legal aspects of fairness.<br /><br />I will be speaking at the first one, introducing the basics of Machine Learning and scenarios in which its use can lead to inadvertent discrimination. My inimitable colleague Richard Berk (who actually builds models used to predict criminal recidivism used by the state of Pennsylvania) will be offering his comments following my talk.<br /><br />The second seminar will provide a panel discussion on what the law demands in terms of "fair and equal treatment", and how it relates to the use of machine learning. The panelists will come from Philosophy, Political Science, and Law.<br /><br />The third seminar will be given by our excellent Warren Center postdoc Jamie Morgenstern, and will focus on technical solutions to the problem of unfairness in machine learning, and how it can be squared with learning the optimal policy in online decision making settings.<br /><br />Finally, the fourth seminar will be an exciting keynote delivered by the current Deputy U.S. CTO Ed Felten on uses of machine learning in government.<br /><br />I believe the talks will be recorded.<br /><br /><br />We will begin next semester with the second Fels sponsored workshop, organized between computer science and economics -- a 2 day intensive workshop exploring current research on technical and economic solutions to addressing unfairness in decision making. More details to come.<br /><br />Below is the schedule for this semester:<br /><span style="background-color: white; color: #222222; font-family: "arial" , sans-serif; font-size: 12.8px;"><br /></span><span style="background-color: white; color: #222222; font-family: "arial" , sans-serif; font-size: 12.8px;">The “Optimizing Government” interdisciplinary research collaboration, supported by the Fels Policy Research Initiative, will hold the following workshops this fall:</span><br /><br /><blockquote style="background-color: white; border: none; color: #222222; font-family: arial, sans-serif; font-size: 12.8px; margin: 0px 0px 0px 40px; padding: 0px;"><b><span class="aBn" data-term="goog_1970176626" style="border-bottom: 1px dashed rgb(204, 204, 204); position: relative; top: -2px; z-index: 0;" tabindex="0"><span class="aQJ" style="position: relative; top: 2px; z-index: -1;">Thursday, 9/22/16</span></span>: What is Machine Learning (and Why Might it be Unfair)?</b><i>Fundamentals of machine learning with a focus on what makes it different from traditional statistical analysis and why it might lead to unfair outcomes.</i>Speakers: Aaron Roth (Penn Computer Science), with comments from Richard Berk (Wharton Statistics; Chair of SAS Criminology)<br /><div><br /></div><div><b><span class="aBn" data-term="goog_1970176627" style="border-bottom: 1px dashed rgb(204, 204, 204); position: relative; top: -2px; z-index: 0;" tabindex="0"><span class="aQJ" style="position: relative; top: 2px; z-index: -1;">Thursday, 10/6/16</span></span>: What Does Fair and Equal Treatment Demand?</b></div><div><i>Current legal and moral norms about fairness and equal protection as they relate to the use of machine learning in government.</i></div><div>Speakers: Panel featuring Samuel Freeman (Penn Philosophy), Nancy Hirschmann (Penn Political Science), and Seth Kreimer (Penn Law)</div><div><br /></div><div><b><span class="aBn" data-term="goog_1970176628" style="border-bottom: 1px dashed rgb(204, 204, 204); position: relative; top: -2px; z-index: 0;" tabindex="0"><span class="aQJ" style="position: relative; top: 2px; z-index: -1;">Thursday, 11/3/16</span></span>: Fairness and Performance Trade-Offs in Machine Learning</b></div><div><i>Technical solutions to fairness challenges raised by machine learning and their impacts on algorithm effectiveness.</i></div><div>Speaker: Jamie Morgenstern (Penn Computer Science)</div><div><br /></div><div><b><span class="aBn" data-term="goog_1970176629" style="border-bottom: 1px dashed rgb(204, 204, 204); position: relative; top: -2px; z-index: 0;" tabindex="0"><span class="aQJ" style="position: relative; top: 2px; z-index: -1;">Thursday, 11/17/16</span></span>: Keynote on Machine Learning and Government</b></div><div><i>How to use machine learning for a variety of administrative and policy functions, and findings from a White House initiative on artificial intelligence in government.</i></div><div>Speaker: Ed Felten, Deputy U.S. CTO (Invited)</div></blockquote><div style="background-color: white;"><div style="color: #222222; font-family: arial, sans-serif; font-size: 12.8px;"><br /></div><div style="color: #222222; font-family: arial, sans-serif; font-size: 12.8px;"><b>Each workshop will take place from <u><span class="aBn" data-term="goog_1970176630" style="border-bottom: 1px dashed rgb(204, 204, 204); position: relative; top: -2px; z-index: 0;" tabindex="0"><span class="aQJ" style="position: relative; top: 2px; z-index: -1;">4:30-6:00 pm</span></span> in Gittis 213 (Penn Law)</u>.</b> You can enter the Law School through its main entrance at 3501 Sansom Street.</div><div style="color: #222222; font-family: arial, sans-serif; font-size: 12.8px;"><br /></div><b style="color: #222222; font-family: arial, sans-serif; font-size: 12.8px;">EDIT: The Optimizing Government Project now has a website: </b><span style="color: #222222; font-family: arial, sans-serif;"><span style="font-size: 12.8px;"><b><a href="https://www.law.upenn.edu/institutes/ppr/optimizing-government-project/">https://www.law.upenn.edu/institutes/ppr/optimizing-government-project/ </a>and the talks will be livestreamed here:<a href="https://www.law.upenn.edu/institutes/ppr/optimizing-government-project/media.php"> https://www.law.upenn.edu/institutes/ppr/optimizing-government-project/media.php</a></b></span></span></div>Aaronhttp://www.blogger.com/profile/09952936358739421126noreply@blogger.com0tag:blogger.com,1999:blog-25562705.post-74237352911181202432016-08-24T09:07:00.002-04:002016-08-24T09:07:48.711-04:00Call for Papers: Second Workshop on Adaptive Data AnalysisAs part of NIPS 2016, we will be running the second annual workshop on adaptive data analysis. Last year's workshop was a big hit. As a new addition this year, we are soliciting submitted contributions in addition to invited speakers. The call for papers is below. If you are working on relevant work, definitely submit it to our workshop! More information at: <a href="http://wadapt.org/">http://wadapt.org/</a><div><br /></div><br />Call for Papers<br /><br /><br />The overall goal of WADAPT is to stimulate the discussion on theoretical analysis and practical aspects of adaptive data analysis. We seek contributions from different research areas of machine learning, statistics and computer science. Submissions focused on a particular area of application are also welcome.<br /><br />Submissions will undergo a lightweight review process and will be judged on originality, relevance, clarity, and the extent to which their presentation can stimulate the discussion between different communities at the workshop. Submissions may describe either novel work (completed or in progress), or work already published or submitted elsewhere provided that it first appeared after September 1, 2015.<br /><br />Authors are invited to submit either a short abstract (2-4 pages) or a complete paper by Oct 15, 2016. Information about previous publication, if applicable, should appear prominently on the first page of the submission. Abstracts must be written in English and be submitted as a single PDF file at <a href="https://easychair.org/conferences/?conf=wadapt2016">EasyChair</a>.<br /><br />All accepted abstracts will be presented at the workshop as posters and some will be selected for an oral presentation. The workshop will not have formal proceedings, and presentation at the workshop is not intended to preclude later publication at another venue.<br /><br />Those who need to receive a notification before the NIPS early registration deadline (Oct 6, 2016) should submit their work by the early submission deadline of Sept 23, 2016.<br /><br />Important Dates:<br /><br /><br />Submission deadlines. Early: Sep 23, 2016; Regular: Oct 25, 2016. Submit at <a href="https://easychair.org/conferences/?conf=wadapt2016">EasyChair</a>.<br />Notification of acceptance. Early: Oct 3, 2016, Regular: Nov 7, 2016.<br />Workshop: December 9, 2016<br /><br /><br /><br />Specific topics of interest for the workshop include (but are not limited to):<br /><br /><br />Selective/post-selection inference<br />Sequential/online false discovery rate control<br />Algorithms for answering adaptively chosen data queries<br />Computational and statistical barriers to adaptive data analysis<br />Stability measures and their applications to generalization<br />Information-theoretic approaches to generalizationAaronhttp://www.blogger.com/profile/09952936358739421126noreply@blogger.com0tag:blogger.com,1999:blog-25562705.post-70397341264537680572016-06-04T17:18:00.002-04:002016-06-04T17:18:33.352-04:00Machine Learning PostdocMy brand new colleague Shivani Agarwal is in the market for a postdoc; the announcement is below. One of the targeted areas is machine learning and economics. Whoever takes this position will join a growing group of exceptional postdocs in this area at Penn, including Jamie Morgenstern and Bo Waggoner.<br /><br /><br /><br /><pre style="white-space: pre-wrap; word-wrap: break-word;">Postdoctoral Position in Machine Learning at UPenn<br /><br />Applications are invited for a postdoctoral position in machine learning in the Department of Computer and Information Science at the University of Pennsylvania. The position is expected to begin in Fall 2016, and is for a period of up to two years (with renewal in the second year contingent on performance in the first year). Applications in all areas of machine learning will be considered, with special emphasis on the following areas: ranking and choice modeling; connections between machine learning and economics; and learning of complex structures.<br /><br />The ideal candidate will demonstrate both ability for independent thinking and interest in co-mentoring of graduate students. The candidate will work primarily with Shivani Agarwal (joining UPenn faculty in July 2016), but will also have opportunities to collaborate with other faculty in machine learning and related areas at UPenn, including Michael Kearns, Daniel Lee, Sasha Rakhlin, Aaron Roth, Lyle Ungar, and other faculty.<br /><br />UPenn is located in the vibrant city of Philadelphia, which is known for its rich culture, history, museums, parks, and restaurants. It is less than 1.5 hrs by train to NYC, 1.5 hrs by flight to Boston, 2 hrs by train to Washington DC, and 40 mins by train to Princeton. For more details about the CIS department at UPenn, see:<br /><br />http://www.cis.upenn.edu/<br /><br />To apply, send the following materials in an email titled “Application for Postdoctoral Position” to <ashivani seas.upenn.edu=""> by June 17, 2016:<br /><br />- curriculum vitae<br />- 2-page statement of research interests and goals<br />- 3 representative publications or working papers<br />- 3 letters of recommendation (to be sent separately by the same date)<br /><br />Shortlisted candidates will be invited for a short meeting/interview at ICML/COLT in NYC during June 23-26 (in your email, please indicate your availability for this).<br /></ashivani></pre><div><br /></div>Aaronhttp://www.blogger.com/profile/09952936358739421126noreply@blogger.com0