## Tuesday, September 01, 2020

### No Regret Algorithms from the Min Max Theorem

The existence of no-regret learning algorithms can be used to prove Von-Neumann's min-max theorem. This argument is originally due to Freund and Schapire, and I teach it to my undergraduates in my algorithmic game theory class. The min-max theorem also can be used to prove the existence of no-regret learning algorithms. Here is a constructive version of the argument (Constructive in that in the resulting algorithm, you only need to solve polynomially sized zero-sum games, so you can do it via linear programming).

Recall the setting. Play proceeds in rounds $t \in \{1,\ldots,T\}$. At each day $t$, the learner chooses one of $k$ actions $i_t \in \{1,\ldots,k\}$, and the adversary chooses a loss vector $\ell^t \in [0,1]^k$. The learner incurs loss $\ell^t_{i_t}$, corresponding to the action he chose. At the end of the interaction, the regret of the learner is defined to be the difference between the cumulative loss he incurred and the cumulative loss of the best fixed action (played consistently)  in hindsight:
$$\textrm{Regret}_T = \max_j \left(\sum_{t=1}^T \ell^t_{i_t} - \ell^t_j\right)$$
A classical and remarkable result is that there exist algorithms that can guarantee that regret grows only sublinearly with time: $\textrm{Regret}_T = O(\sqrt{T})$. Lets prove this.

Define the non-negative portion of our cumulative regret with respect to action $j$ up until day $d$ as:
$$V_d^j = \left(\sum_{t=1}^d\left(\ell^t_{i_t} - \ell^t_j\right)\right)^+$$
and our additional regret at day $d+1$ with respect to action $j$ as:
$$r_{j}^{d+1} = \ell_{i_{d+1}}^{d+1} - \ell^{d+1}_j$$
Observe that if $V_{d}^j \geq 1$  then $V_{d+1}^j = V_d^j + r_j^{d+1}$.

Define a surrogate loss function as our squared cumulative regrets, summed over all actions:
$$L_d = \sum_{j=1}^k (V_d^j)^2$$
Observe that we can write the expected gain in our loss on day $d+1$, conditioned on the history thus far:
$$\mathbb{E}[L_{d+1} - L_d] \leq \sum_{j : V_d^j \geq 1} \mathbb{E}[(V_d^j+r_j^{d+1})^2 - (V_d^j)^2) ] + 3k$$
$$= \sum_{j : V_d^j \geq 1} \left(2V_d^j \mathbb{E}[r_{j}^{d+1}] + \mathbb{E}[(r_{j}^{d+1})^2]\right) + 3k$$
$$\leq \sum_{j=1}^k \left(2V_d^j \mathbb{E}[r_{j}^{d+1}]\right) + 4k$$
where the expectations are taken over the randomness of both the learner and the adversary in round $d+1$.

Now consider a zero-sum game played between the learner and the adversary in which the learner is the minimization player, the adversary is the maximization player, and the utility function is $$u(i_{d+1}, \ell^{d+1}) = \sum_{j=1}^k \left(2V_d^j \mathbb{E}[r_{j}^{d+1}]\right)$$ The min-max theorem says that the learner can guarantee the same payoff for herself in the following two scenarios:

1. The learner first has to commit to playing a distribution $p_{d+1}$ over actions $i$, and then the adversary gets to best respond by picking the worst possible loss vectors, or
2. The adversary has to first commit to a distribution over loss vectors $\ell$ and then the learner gets the benefit of picking the best action $i_{d+1}$ to respond with.
Scenario 1) is the scenario our learner finds herself in, when playing against an adaptive adversary. But 2) is much easier to analyze. If the adversary first commits to a distribution over loss vectors $\ell^{d+1}$, the learner can always choose action $i_{d+1} = \arg\min_j \mathbb{E}[\ell^{d+1}_j]$, which guarantees that $\mathbb{E}[r_{j}^{d+1}] \leq 0$, which in turn guarantees that the value of the game $\sum_{j=1}^k \left(2V_d^j \mathbb{E}[r_{j}^{d+1}]\right) \leq 0$.  Hence, the min-max theorem tells us that the learner always has a distribution over actions $p_{d+1}$ that guarantees that $\mathbb{E}[L_{d+1} - L_d] \leq 4k$, even in the worst case over loss functions. If the learner always plays according to this distribution, then by a telescoping sum, we have that:
$$\mathbb{E}[L_T] \leq 4kT$$.
We therefore have by Jensen's inequality that:
$$\mathbb{E}[\max_j (V^j_T)] \leq \sqrt{\mathbb{E}[\max_j (V^j_T)^2]}\leq \sqrt{\mathbb{E}[\sum_{j=1}^k (V_j^T)^2]} \leq 2\sqrt{kT}$$.