Make the Minority Great Again: FirstOrder Regret Bound for Contextual Bandits
[edit]
Proceedings of the 35th International Conference on Machine Learning, PMLR 80:186194, 2018.
Abstract
Regret bounds in online learning compare the player’s performance to $L*$, the optimal performance in hindsight with a fixed strategy. Typically such bounds scale with the square root of the time horizon $T$. The more refined concept of firstorder regret bound replaces this with a scaling $\sqrt{L*}$, which may be much smaller than $\sqrt{T}$.It is well known that minor variants of standard algorithms satisfy firstorder regret bounds in the full information and multiarmed bandit settings. In a COLT 2017 open problem, Agarwal, Krishnamurthy, Langford, Luo, and Schapire raised the issue that existing techniques do not seem sufficient to obtain firstorder regret bounds for the contextual bandit problem. In the present paper, we resolve this open problem by presenting a new strategy based on augmenting the policy space.
Related Material


