Recall the setting. Play proceeds in rounds t∈{1,…,T}. At each day t, the learner chooses one of k actions it∈{1,…,k}, and the adversary chooses a loss vector ℓt∈[0,1]k. The learner incurs loss ℓtit, corresponding to the action he chose. At the end of the interaction, the regret of the learner is defined to be the difference between the cumulative loss he incurred and the cumulative loss of the best fixed action (played consistently) in hindsight:
RegretT=maxj(T∑t=1ℓtit−ℓtj)
A classical and remarkable result is that there exist algorithms that can guarantee that regret grows only sublinearly with time: RegretT=O(√T). Lets prove this.
Define the non-negative portion of our cumulative regret with respect to action j up until day d as:
Vjd=(d∑t=1(ℓtit−ℓtj))+
and our additional regret at day d+1 with respect to action j as:
rd+1j=ℓd+1id+1−ℓd+1j
Observe that if Vjd≥1 then Vjd+1=Vjd+rd+1j.
Define a surrogate loss function as our squared cumulative regrets, summed over all actions:
Ld=k∑j=1(Vjd)2
Observe that we can write the expected gain in our loss on day d+1, conditioned on the history thus far:
E[Ld+1−Ld]≤∑j:Vjd≥1E[(Vjd+rd+1j)2−(Vjd)2)]+3k
=∑j:Vjd≥1(2VjdE[rd+1j]+E[(rd+1j)2])+3k
≤k∑j=1(2VjdE[rd+1j])+4k
where the expectations are taken over the randomness of both the learner and the adversary in round d+1.
Now consider a zero-sum game played between the learner and the adversary in which the learner is the minimization player, the adversary is the maximization player, and the utility function is u(id+1,ℓd+1)=k∑j=1(2VjdE[rd+1j]) The min-max theorem says that the learner can guarantee the same payoff for herself in the following two scenarios:
- The learner first has to commit to playing a distribution pd+1 over actions i, and then the adversary gets to best respond by picking the worst possible loss vectors, or
- The adversary has to first commit to a distribution over loss vectors ℓ and then the learner gets the benefit of picking the best action id+1 to respond with.
E[LT]≤4kT.
We therefore have by Jensen's inequality that:
E[maxj(VjT)]≤√E[maxj(VjT)2]≤√E[k∑j=1(VTj)2]≤2√kT.
No comments:
Post a Comment