Monday, March 12, 2018

Call for nominations for the SIGecom Dissertation Award

Dear all,

Please consider nominating recently graduated Ph.D. students working in algorithmic game theory/mechanism design/market design for the SIGecom Dissertation Award.  If you are a graduating student, consider asking your adviser or other senior mentor to nominate you.

Nominations are due at the end of this month, March 31, 2018.  This award is given to a student who defended a thesis in 2017.  It is a prestigious award and is accompanied by a $1500 prize.  In the past, the grand prize has been awarded to:

2016: Peng Shi, " Prediction and Optimization in School Choice"
2015: Inbal Talgam-Cohen, " Robust Market Design: Information and Computation "
2014: S. Matthew Weinberg, "Algorithms for Strategic Agents"
2013: Balasubramanian Sivan, " Prior Robust Optimization"
and the award has had five runner-ups, Bo Waggoner, James Wright, Xi (Alice) Gao, Yang Cai, and Sigal Oren.  You can find detailed information about the nomination process at: http://www.sigecom.org/awardd.html. We look forward to reading your nominations!
Your Award Committee,
Nicole Immorlica
Ariel Procaccia
Aaron Roth

Sunday, March 04, 2018

How (un)likely is an "intelligence explosion"?


I've been having fun recently reading about "AI Risk". There is lots of eloquent writing out there about this topic: I especially recommend Scott Alexander's Superintelligence FAQ for those looking for a fun read. The subject has reached the public consciousness, with high profile people like Stephen Hawking and Elon Musk speaking publicly about it. There is also an increasing amount of funding and research effort being devoted to understanding AI risk. See for example the Future of Humanity Institute at Oxford, the Future of Life Institute at MIT, and the Machine Intelligence Research Institute in Berkeley, among others. These groups seem to be doing lots of interesting research, which I am mostly ignorant of. In this post I just want to talk about a simple exercise in asymptotics.

First, Some Background.

A "superintelligent" AI is loosely defined to be an entity that is much better than we are at essentially any cognitive/learning/planning task. Perhaps, by analogy, a superintelligent AI is to human beings as human beings are to Bengal tigers, in terms of general intelligence. It shouldn't be hard to convince yourself that if we were in the company of a superintelligence, then we would be very right to be worried: after all, it is intelligence that allows human beings to totally dominate the world and drive Bengal tigers to near extinction, despite the fact that tigers physiologically dominate humans in most other respects. This is the case even if the superintelligence doesn't have the destruction of humanity as a goal per-se (after all, we don't have it out for tigers), and even if the superintelligence is just an unconscious but super-powerful optimization algorithm. I won't rehash the arguments here (Scott does it better) but it essentially boils down to the fact that it is quite hard to anticipate what the results of optimizing an objective function will be, if the optimization is done over a sufficiently rich space of strategies. And if we get it wrong, and the optimization has some severely unpleasant side-effects? It is tempting to suggest that at that point, we just unplug the computer and start over. The problem is that if we unplug the intelligence, it won't do as well at optimizing its objective function compared to  if it took steps to prevent us from unplugging it. So if it's strategy space is rich enough so that it is able to take steps to defend itself, it will. Lots of the most interesting research in this field seems to be about how to align optimization objectives with our own desires, or simply how to write down objective functions that don't induce the optimization algorithm to try and prevent us from unplugging it, while also not incentivizing the algorithm to unplug itself (the corrigibility problem).

Ok. It seems uncontroversial that a hypothetical superintelligence would be something we should take very seriously as a danger. But isn't it premature to worry about this, given how far off it seems to be? We aren't even that good at making product recommendations, let alone optimization algorithms so powerful that they might inadvertently destroy all of humanity. Even if superintelligence will ultimately be something to take very seriously, are we even in a position to productively think about it now, given how little we know about how such a thing might work at a technical level? This seems to be the position that Andrew Ng was taking, in his much quoted statement that (paraphrasing) worrying about the dangers of super-intelligence right now is like worrying about overpopulation on Mars. Not that it might not eventually be a serious concern, but that we will get a higher return investing our intellectual efforts right now on more immediate problems.

The standard counter to this is that super-intelligence might always seem like it is well beyond our current capabilities -- maybe centuries in the future -- until, all of a sudden, it appears as the result of an uncontrollable chain reaction known as an "intelligence explosion", or "singularity". (As far as I can tell, very few people actually think that intelligence growth would exhibit an actual mathematical singularity --- this seems instead to be a metaphor for exponential growth.) If this is what we expect, then now might very well be the time to worry about super-intelligence. The first argument of this form was put forth by British mathematician I.J. Good (of Good-Turing Frequency Estimation!):
“Let an ultraintelligent machine be defined as a machine that can far surpass all the intellectual activities of any man however clever. Since the design of machines is one of these intellectual activities, an ultraintelligent machine could design even better machines; there would then unquestionably be an ‘intelligence explosion,’ and the intelligence of man would be left far behind. Thus the first ultraintelligent machine is the last invention that man need ever make, provided that the machine is docile enough to tell us how to keep it under control.”
Scott Alexander summarizes the same argument a bit more quantitatively. In this passage, he is imagining the starting point being a full-brain simulation of Einstein --- except run on faster hardware, so that our simulated Einstein operates at a much faster clock-speed than his historical namesake:
It might, like the historical Einstein, contemplate physics. Or it might contemplate an area very relevant to its own interests: artificial intelligence. In that case, instead of making a revolutionary physics breakthrough every few hours, it will make a revolutionary AI breakthrough every few hours. Each AI breakthrough it makes, it will have the opportunity to reprogram itself to take advantage of its discovery, becoming more intelligent, thus speeding up its breakthroughs further. The cycle will stop only when it reaches some physical limit – some technical challenge to further improvements that even an entity far smarter than Einstein cannot discover a way around. 
To human programmers, such a cycle would look like a “critical mass”. Before the critical level, any AI advance delivers only modest benefits. But any tiny improvement that pushes an AI above the critical level would result in a feedback loop of inexorable self-improvement all the way up to some stratospheric limit of possible computing power. 
This feedback loop would be exponential; relatively slow in the beginning, but blindingly fast as it approaches an asymptote. Consider the AI which starts off making forty breakthroughs per year – one every nine days. Now suppose it gains on average a 10% speed improvement with each breakthrough. It starts on January 1. Its first breakthrough comes January 10 or so. Its second comes a little faster, January 18. Its third is a little faster still, January 25. By the beginning of February, it’s sped up to producing one breakthrough every seven days, more or less. By the beginning of March, it’s making about one breakthrough every three days or so. But by March 20, it’s up to one breakthrough a day. By late on the night of March 29, it’s making a breakthrough every second.
As far as I can tell, this possibility of an exponentially-paced intelligence explosion is the main argument for folks devoting time to worrying about super-intelligent AI now, even though current technology doesn't give us anything even close. So in the rest of this post, I want to push a little bit on the claim that the feedback loop induced by a self-improving AI would lead to exponential growth, and see what assumptions underlie it.

A Toy Model for Rates of Self Improvement

Lets write down an extremely simple toy model for how quickly the intelligence of a self improving system would grow, as a function of time. And I want to emphasize that the model I will propose is clearly a toy: it abstracts away everything that is interesting about the problem of designing an AI. But it should be sufficient to focus on a simple question of asymptotics, and the degree to which growth rates depend on the extent to which AI research exhibits diminishing marginal returns on investment. In the model, AI research accumulates with time: at time t, R(t) units of AI research have been conducted. Perhaps think of this as a quantification of the number of AI "breakthroughs" that have been made in Scott Alexander's telling of the intelligence explosion argument. The intelligence of the system at time t, denoted I(t), will be some function of the accumulated research R(t). The model will make two assumptions:

  1. The rate at which research is conducted is directly proportional to the current intelligence of the system. We can think about this either as a discrete dynamics, or as a differential equation. In the discrete case, we have: $R(t+1) = R(t) + I(t)$, and in the continuous case: $\frac{dR}{dt} = I(t)$. 
  2. The relationship between the current intelligence of the system and the currently accumulated quantity of research is governed by some function f: $I(t) = f(R(t))$.
The function f can be thought of as capturing the marginal rate of return of additional research on the actual intelligence of an AI. For example, if we think AI research is something like pumping water from a well --- a task for which doubling the work doubles the return --- then, we would model f as linear: $f(x) = x$. In this case, AI research does not exhibit any diminishing marginal returns: a unit of research gives us just as much benefit in terms of increased intelligence, no matter how much we already understand about intelligence. On the other hand, if we think that AI research should exhibit diminishing marginal returns --- as many creative endeavors seem to --- then we would model f as an increasing concave function. For example, we might let $f(x) = \sqrt{x}$, or $f(x) = x^{2/3}$, or $f(x) = x^{1/3}$, etc. If we are really pessimistic about the difficulty of AI, we might even model $f(x) = \log(x)$.  In these cases, intelligence is still increasing in research effort, but the rate of increase as a function of research effort is diminishing, as we understand more and more about AI. Note however that the rate at which research is being conducted is increasing, which might still lead us to exponential growth in intelligence, if it increases fast enough.

So how does our choice of f affect intelligence growth rates? First, lets consider the case in which $f(x) = x$ --- the case of no diminishing marginal returns on research investment. Here is a plot of the growth over 1000 time steps in the discrete model: 
Here, we see exponential growth in intelligence. (It isn't hard to directly work out that in this case, in the discrete model, we have $I(t) = 2^t$, and in the continuous model, we have $I(t) = e^t$). And the plot illustrates the argument for worrying about AI risk now. Viewed at this scale, progress in AI appears to plod along at unimpressive levels before suddenly shooting up to an unimaginable level: in this case, a quantity if written down as a decimal that would have more than 300 zeros. 

It isn't surprising that if we were to model severely diminishing returns --- say $f(x) = \log(x)$, that this would not occur. Below, we plot what happens when $f(x) = \log(x)$, with time taken out all the way to 1,000,000 rather than merely 1000 as in the above plot:
Intelligence growth is not very impressive here. At time 1,000,000 we haven't even reached 17. If you wanted to reach (say) an intelligence level of 30 you'd have to wait an unimaginably long time. In this case, we definitely don't need to worry about an "intelligence explosion", and probably not even about ever reaching anything that could be called a super-intelligence. 

But what about moderate (polynomial) levels of diminishing marginal returns. What if we take $f(x) = x^{1/3}$? Lets see:
Ok --- now we are making more progress, but even though intelligence now has a polynomial relationship to research (and research speed is increasing, in a chain reaction!) the rate of growth in intelligence is still decreasing. What about if $f(x) = \sqrt{x}$? Lets see:
At least now the rate of growth doesn't seem to be decreasing: but it is growing only linearly with time. Hardly an explosion. Maybe we just need to get more aggressive in our modeling. What if $f(x) = x^{2/3}$? 
Ok, now we've got something! At least now the rate of intelligence gains is increasing with time. But it is increasing more slowly than a quadratic function -- a far cry from the exponential growth that characterizes an intelligence explosion. 

Lets take a break from all of this plotting. The model we wrote down is simple enough that we can just go and solve the differential equation. Suppose we have $f(x) = x^{1-\epsilon}$ for some $\epsilon > 0$. Then the differential equation solves to give us:
\[I(t) = \left(\left(1+\epsilon t \right)^{1/\epsilon} \right)^{1-\epsilon} \]
What this means is that for any positive value of $\epsilon$, in this model, intelligence grows at only a polynomial rate. The only way this model gives us exponential growth is if we take $\epsilon \rightarrow 0$, and insist that $f(x) = x$ --- i.e. that the intelligence design problem does not exhibit any diminishing marginal returns at all. 

Thoughts
So what do we learn from this exercise? Of course one can quibble with the details of the model, and one can believe different things about what form for the function f best approximates reality. But for me, this model helps crystallize the extent to which the "exponential intelligence explosion" story crucially relies on intelligence design being one of those rare tasks that doesn't exhibit any decreasing marginal returns on effort at all. This seems unlikely to me, and counter to experience

Of course, there are technological processes out there that do appear to exhibit exponential growth, at least for a little while. Moore's law is the most salient example. But it is important to remember that even exponential growth for a little while need not seem explosive at human time scales. Doubling every day corresponds to exponential growth, but so does increasing by 1% a year. To paraphrase Ed Felten: our retirement plans extend beyond depositing a few dollars into a savings account, and waiting for the inevitable "wealth explosion" that will make us unimaginably rich. 


Postscript
I don't claim that anything in this post is either novel or surprising to folks who spend their time thinking about this sort of thing. There is at least one paper that writes down a model including diminishing marginal returns, which yields a linear rate of intelligence growth.

It is also interesting to note that in the model we wrote down, exponential growth is really a knife edge phenomenon. We already observed that we get exponential growth if $f(x) = x$, but not if $f(x) = x^{1-\epsilon}$ for any $\epsilon > 0$. But what if we have $f(x) = x^{1+\epsilon}$ for $\epsilon > 0$? In that case, we don't get exponential growth either! As Hadi Elzayn pointed out to me, Osgood's Test tell us that in this case, the function $I(t)$ contains an actual mathematical singularity --- it approaches an infinite value in finite time. 

Wednesday, January 10, 2018

Fairness and The Problem with Exploration: A Smoothed Analysis of the Greedy Algorithm for the Linear Contextual Bandit Problem

Bandit Problems

"Bandit problems" are a common abstraction in machine learning. The name is supposed to evoke the image of slot machines, which are also known as "One-armed bandits" (or so I am told... Somehow nobody speaks like this in the circles I run in.) In the classic formulation, you imagine yourself standing in front of a bank of slot machines, each of which is different and might have a different distribution on payoffs. You can sequentially decide which machine's arm to pull: when you pull the arm, you get a sample from that machine's reward distribution. Your goal is to pull the arms in such a way so that your average reward approaches what it would have been had you played the optimal strategy in hindsight: i.e. always pulled the arm of the machine that had the highest mean reward. The problem is tricky because you don't know the reward distributions to start. The key feature of the problem is that in order to observe a sample from the reward distribution of a particular machine, you actually have to pull that arm and spend a turn to do it.

There are more complicated variants of this kind of problem, in which your choices may vary from round to round, but have differentiating features. For example, in the linear contextual bandit problem, in every round, you are presented with a number of choices, each of which is represented by some real valued vector, which may be different from round to round. You then get to choose one of them. The reward you get, corresponding to the action you choose, is random --- but its expectation is equal to some unknown linear function of the action's features. The key property of the problem is that you only get to observe the reward of the action you choose. You do not get to observe the reward you would have obtained had you chosen one of the other actions --- this is what makes it a bandit problem. The goal again is to choose actions such that your average reward approaches what it would have been had you played the optimal strategy in hindsight. In this case, with full knowledge of the underlying linear functions, the optimal strategy would simply compute the expected reward of each action, and play the action with highest expected reward (this may now be a different action at  each round, since the features change). Again, the problem is tricky because you don't know the linear functions which map features to expected reward. 

Bandit problems are characterized by the tension between exploration and exploitation. At any given moment, the algorithm has some guess as to what the best action is. (In a linear contextual bandit problem, for example, the algorithm might just run least-squares regression on the examples it has observed rewards for: the point predictions of the regression estimate on the new actions are these best guesses). In order to be able to compete with the optimal policy, the algorithm definitely needs to (eventually) play the action it thinks is the best one, a lot of the time. This is called exploitation: the algorithm is exploiting its knowledge in order to do well. However, since the algorithm does not observe the rewards for actions it does not play, it also will generally need to explore. Consider someone in front of those slot machines: if he has played each machine once, he has some weak beliefs about the payoff distribution for each machine. But those beliefs have resulted from just a single sample, so they may well be very wrong. If he just forever continues to play the machine that has the highest empirical reward, he might play the same machine every day, and he will never learn more about the other machines  --- even though one of them might have turned out to really have higher average reward. He has fooled himself, and because he never explores, he continues to be wrong forever. So optimal algorithms have to carefully balance exploration and exploitation in order to do well. 

There are various clever schemes for trading off exploration and exploitation, but the simplest one is called epsilon-Greedy. At every round, the algorithm flips a coin with bias epsilon. If the coin comes up heads (with probability 1-epsilon), the algorithm exploits: it just greedily chooses the action that it estimates to be best. If the coin comes up tails (with probability epsilon), the algorithm explores: it chooses an action uniformly at random. If you set epsilon appropriately, in many settings, you have the guarantee that the average reward of the algorithm will approach the average reward of the optimal policy, as time goes on. Fancier algorithms for bandit problems more smoothly interpolate between exploration and exploitation: but all of them inevitably have to manage this tradeoff. 

Fairness Concerns

(Contextual) bandit problems are not just an interesting mathematical curiosity, but actually arise all the time in important applications. Here are just a few:
  • Lending: Banks (and increasingly algorithms) consider applicants for loans. They can look at applicant features, and observe the repayment history of applicants they grant loans to, but they cannot observe counterfactuals: would an applicant not given a loan have defaulted if he had been given the loan? 
  • Parole Decisions: Judges (and increasingly algorithms) consider inmates for parole. In part, they want to predict which inmates will go on to re-offend if released, and not release those inmates. But it is not possible to observe the counterfactual: would an inmate who was not released have gone on to commit a crime if he had been released? 
  • Predictive Policing: Police chiefs (and increasingly algorithms) consider where in their city they want to deploy their police. In part, they want to predict where crimes will occur, so they can deploy a heavier police presence in those areas. But they also disproportionately observe crimes in the areas in which police are deployed.
  • Sequential Clinical Trials: Certain kinds of drugs affect patients differently, depending on their genetic markers. But for a new drug, the relationship might be unknown. As patients show up, they can be assigned to different clinical trials, corresponding to different drugs --- but we cannot observe the counterfactual: how would a patient have reacted to a drug she was not given? 
I picked these four examples (rather than the more commonly used example of ad targeting) because in each of the above examples, the decision made by the algorithm has an important effect on someone's life. This raises issues of fairness, and as we will see, these fairness questions relate to exploration in at least two different ways. 

First, there is an issue that was recently raised by Bird, Barocas, Crawford, Diaz, and Wallach: that exploring can be unfair to the individuals who have the misfortune of being present during the exploration rounds. Consider the example of sequential clinical trials: if a patient shows up on an "exploitation" round, then she is given the treatment that is believed to be the best for her, using a best effort estimate given the information available. But if she shows up during an "exploration" round, she will be given a random treatment --- generally, one that current evidence suggests is not the best for her. Is this fair to her? What if her life depends on the treatment? Exploration is explicitly sacrificing her well-being, for the possible good of future individuals, who might be able to benefit from what we learned from the exploration. It can be that when we are dealing with important decisions about peoples lives, we don't want to sacrifice the welfare of an individual for the good of others. Certainly, our notional patient would prefer not to show up on an exploration round. In various other settings, we can also imagine exploration being repugnant, even though it is in principle necessary for learning. Can we (for example) randomly release inmates on parole, even if we believe them to be high risks for committing more violent crimes? 

There is another, conflicting concern: if we do not explore, we might not correctly learn about the decisions we have to make. And there are a number of reasons we might expect to see insufficient exploration. Exploration is necessary to maximize long-term reward, but decision makers might be myopic. For example, the loan officers actually making lending decisions might be more interested in maximizing their annual bonuses than maximizing the long-term profits of the bank. Even the CEO might be more interested in quarterly share prices. But myopic decision makers won't explore (tangent: We had a paper at EC thinking about how one might incentivize myopic decision makers to nevertheless be meritocratically fair). The algorithms in use in many settings might also not have been designed properly --- if our learning algorithms don't explicitly take into account the bandit nature of the problem, and just treat it as a supervised classification problem, then they won't explore --- and this kind of mistake is probably extremely common. And finally, as we argued above, we might believe that it is simply not "fair" to explore, and so we intentionally avoid it. But a lack of exploration (and the resulting failure to properly learn) can itself be a source of unfairness. The "feedback loops" that result from a lack of exploration have been blamed by Lum and Isaac and by Ensign, Friedler, Neville, Scheidegger, and Venkatasubramanian as a primary source of unfairness in predictive policing. (See Suresh's recent blog post). Such algorithms can over-estimate crime in the poor neighborhoods in which police are deployed, and underestimate crime in the rich neighborhoods. If they don't explore to correct these mis-estimates, they will deploy more police to the poor neighborhoods, and fewer to the rich neighborhoods, which only exacerbates the data collection problem.  

Our New Paper

These two concerns both lead us to wonder how bad things need be if a decision maker doesn't explore, and instead simply runs a greedy algorithm, that exploits at every round. Maybe this is because we believe that greedy algorithms are already being run in many cases, and we want to know how much of a risk we are at for developing pernicious feedback loops. Or maybe we want to run a greedy algorithm by design, because we are in a medical or legal setting in which exploration is unacceptable. This is the question we consider in a new paper, joint with Sampath Kannan, Jamie Morgenstern, Bo Waggoner, and Steven Wu. (P.S. Bo is on the job market right now!)

Specifically, we consider the linear contextual bandit problem, described at the beginning of this post. A decision maker must choose amongst a set of actions every day, each of which is endowed with a vector of features. The reward distribution for each action is governed by an unknown linear function of the features. In this case, the greedy algorithm is simply the algorithm that maintains ordinary least squares regression estimates for each action, and always plays the action that maximizes its current point prediction of the reward, using its current regression estimate. 

Motivated in part by the problem with exploration in sequential drug trials, Bastani, Bayati, and Khosravi previously studied a two-action variant of this problem, and showed that when the features are drawn stochastically from a distribution satisfying certain strong assumptions, then the greedy algorithm works well: its reward approaches that of the optimal policy, without needing exploration! We take their result as inspiration, and prove a theorem of this flavor under substantially weaker conditions (although our model is slightly different, so the results are technically incomparable). 

We consider a model in which the number of actions can be arbitrary, and give a smoothed analysis. What that means is we don't require that the actions or their features are drawn from any particular distribution. Instead, we let them be chosen by an adversary, who knows exactly how our algorithm works, and can be arbitrarily sneaky in his attempt to cause our algorithm to fail to learn. Of course, if we stopped there, then the result would be that the greedy algorithm can be made to catastrophically fail to learn with constant probability: it has long been known that exploration is necessary in the worst case. But our adversary has a slightly shaky hand. After he chooses the actions for a given round, the features of those actions are perturbed by a small amount of Gaussian noise, independently in each coordinate. What we show is that this tiny amount of noise is sufficient to cause the greedy algorithm to succeed at learning: it recovers regret bounds (regret is the difference between the cumulative reward of the greedy algorithm, and what the optimal policy would have done) that are close to optimal, up to an inverse polynomial factor of the standard deviation of the Gaussian noise. (The regret bound must grow with the inverse of the standard deviation of the noise in any smoothed analysis, since we know that if there is no noise, the greedy algorithm doesn't work...) 

The story is actually more nuanced: we consider two different models, derive qualitatively different bounds in each model, and prove a separation: see the paper for details. But the punchline is that, at least in linear settings, what we can show is that "generically" (i.e. in the presence of small perturbations), fast learning is possible in bandit settings, even without exploration. This can be taken to mean that in settings in which exploration is repugnant, we don't have to do it --- and that perhaps pernicious feedback loops that can arise when we fail to explore shouldn't be expected to persist for a long time, if the features we observe are a little noisy. Of course, everything we prove is in a linear setting. We still don't know the extent to which these kinds of smoothed analyses carry over into non-linear settings. Understanding the necessity of exploration in more complex settings seems like an important problem with real policy implications.