[edit]
Nearly Minimax-Optimal Regret for Linearly Parameterized Bandits
Proceedings of the Thirty-Second Conference on Learning Theory, PMLR 99:2173-2174, 2019.
Abstract
We study the linear contextual bandit problem with finite action sets. When the problem dimension is d, the time horizon is T, and there are n≤2d/2 candidate actions per time period, we (1) show that the minimax expected regret is Ω(√dTlogTlogn) for every algorithm, and (2) introduce a Variable-Confidence-Level (VCL) SupLinUCB algorithm whose regret matches the lower bound up to iterated logarithmic factors. Our algorithmic result saves two √logT factors from previous analysis, and our information-theoretical lower bound also improves previous results by one √logT factor, revealing a regret scaling quite different from classical multi-armed bandits in which no logarithmic T term is present in minimax regret. Our proof techniques include variable confidence levels and a careful analysis of layer sizes of SupLinUCB on the upper bound side, and delicately constructed adversarial sequences showing the tightness of elliptical potential lemmas on the lower bound side.