Recall the setting. Play proceeds in rounds $t \in \{1,\ldots,T\}$. At each day $t$, the learner chooses one of $k$ actions $i_t \in \{1,\ldots,k\}$, and the adversary chooses a loss vector $\ell^t \in [0,1]^k$. The learner incurs loss $\ell^t_{i_t}$, corresponding to the action he chose. At the end of the interaction, the regret of the learner is defined to be the difference between the cumulative loss he incurred and the cumulative loss of the best fixed action (played consistently) in hindsight:

$$\textrm{Regret}_T = \max_j \left(\sum_{t=1}^T \ell^t_{i_t} - \ell^t_j\right)$$

A classical and remarkable result is that there exist algorithms that can guarantee that regret grows only sublinearly with time: $\textrm{Regret}_T = O(\sqrt{T})$. Lets prove this.

Define the non-negative portion of our cumulative regret with respect to action $j$ up until day $d$ as:

$$V_d^j = \left(\sum_{t=1}^d\left(\ell^t_{i_t} - \ell^t_j\right)\right)^+$$

and our additional regret at day $d+1$ with respect to action $j$ as:

$$r_{j}^{d+1} = \ell_{i_{d+1}}^{d+1} - \ell^{d+1}_j$$

Observe that if $V_{d}^j \geq 1$ then $V_{d+1}^j = V_d^j + r_j^{d+1}$.

Define a surrogate loss function as our squared cumulative regrets, summed over all actions:

$$L_d = \sum_{j=1}^k (V_d^j)^2$$

Observe that we can write the expected gain in our loss on day $d+1$, conditioned on the history thus far:

$$\mathbb{E}[L_{d+1} - L_d] \leq \sum_{j : V_d^j \geq 1} \mathbb{E}[(V_d^j+r_j^{d+1})^2 - (V_d^j)^2) ] + 3k$$

$$= \sum_{j : V_d^j \geq 1} \left(2V_d^j \mathbb{E}[r_{j}^{d+1}] + \mathbb{E}[(r_{j}^{d+1})^2]\right) + 3k $$

$$\leq \sum_{j=1}^k \left(2V_d^j \mathbb{E}[r_{j}^{d+1}]\right) + 4k$$

where the expectations are taken over the randomness of both the learner and the adversary in round $d+1$.

Now consider a zero-sum game played between the learner and the adversary in which the learner is the minimization player, the adversary is the maximization player, and the utility function is $$u(i_{d+1}, \ell^{d+1}) = \sum_{j=1}^k \left(2V_d^j \mathbb{E}[r_{j}^{d+1}]\right)$$ The min-max theorem says that the learner can guarantee the same payoff for herself in the following two scenarios:

- The learner first has to commit to playing a distribution $p_{d+1}$ over actions $i$, and then the adversary gets to best respond by picking the worst possible loss vectors, or
- The adversary has to first commit to a distribution over loss vectors $\ell$ and then the learner gets the benefit of picking the best action $i_{d+1}$ to respond with.

*even in the worst case over loss functions*. If the learner always plays according to this distribution, then by a telescoping sum, we have that:

$$\mathbb{E}[L_T] \leq 4kT$$.

We therefore have by Jensen's inequality that:

$$\mathbb{E}[\max_j (V^j_T)] \leq \sqrt{\mathbb{E}[\max_j (V^j_T)^2]}\leq \sqrt{\mathbb{E}[\sum_{j=1}^k (V_j^T)^2]} \leq 2\sqrt{kT}$$.