[edit]
Thompson Sampling with Less Exploration is Fast and Optimal
Proceedings of the 40th International Conference on Machine Learning, PMLR 202:15239-15261, 2023.
Abstract
We propose $\epsilon$-Exploring Thompson Sampling ($\epsilon$-TS), a modified version of the Thompson Sampling (TS) algorithm for multi-armed bandits. In $\epsilon$-TS, arms are selected greedily based on empirical mean rewards with probability $1-\epsilon$, and based on posterior samples obtained from TS with probability $\epsilon$. Here, $\epsilon\in(0,1)$ is a user-defined constant. By reducing exploration, $\epsilon$-TS improves computational efficiency compared to TS while achieving better regret bounds. We establish that $\epsilon$-TS is both minimax optimal and asymptotically optimal for various popular reward distributions, including Gaussian, Bernoulli, Poisson, and Gamma. A key technical advancement in our analysis is the relaxation of the requirement for a stringent anti-concentration bound of the posterior distribution, which was necessary in recent analyses that achieved similar bounds. As a result, $\epsilon$-TS maintains the posterior update structure of TS while minimizing alterations, such as clipping the sampling distribution or solving the inverse of the Kullback-Leibler (KL) divergence between reward distributions, as done in previous work. Furthermore, our algorithm is as easy to implement as TS, but operates significantly faster due to reduced exploration. Empirical evaluations confirm the efficiency and optimality of $\epsilon$-TS.