Further Optimal Regret Bounds for Thompson Sampling

[edit]

Shipra Agrawal, Navin Goyal ;
Proceedings of the Sixteenth International Conference on Artificial Intelligence and Statistics, PMLR 31:99-107, 2013.

Abstract

Thompson Sampling is one of the oldest heuristics for multi-armed bandit problems. It is a randomized algorithm based on Bayesian ideas, and has recently generated significant interest after several studies demonstrated it to have comparable or better empirical performance compared to the state of the art methods. In this paper, we provide a novel regret analysis for Thompson Sampling that proves the first near-optimal problem-independent bound of O(\sqrtNT\ln T) on the expected regret of this algorithm. Our novel martingale-based analysis techniques are conceptually simple, and easily extend to distributions other than the Beta distribution. For the version of Thompson Sampling that uses Gaussian priors, we prove a problem-independent bound of O(\sqrtNT\ln N) on the expected regret, and demonstrate the optimality of this bound by providing a matching lower bound. This lower bound of Ω(\sqrtNT\ln N) is the first lower bound on the performance of a natural version of Thompson Sampling that is away from the general lower bound of O(\sqrtNT) for the multi-armed bandit problem. Our near-optimal problem-independent bounds for Thompson Sampling solve a COLT 2012 open problem of Chapelle and Li. Additionally, our techniques simultaneously provide the optimal problem-dependent bound of (1+ε)\sum_i \frac\ln Td(\mu_i, \mu_1)+O(\fracNε^2) on the expected regret. The optimal problem-dependent regret bound for this problem was first proven recently by Kaufmann et al. [2012].

Related Material