[edit]
Thompson Sampling for Bandit Convex Optimisation
Proceedings of Thirty Eighth Conference on Learning Theory, PMLR 291:231-263, 2025.
Abstract
We show that Thompson sampling has a Bayesian regret of at most $\tilde O(\sqrt{n})$ for $1$-dimensional bandit convex optimisation where $n$ is the time horizon and no assumptions are made on the loss function beyond convexity, boundedness and a mild Lipschitz assumption. For general high-dimensional problems we show that Thompson sampling can fail catastrophically. More positively, we show that Thompson sampling has Bayesian regret of $\tilde O(d^{2.5} \sqrt{n})$ for generalised linear bandits with an unknown convex monotone link function. Lastly, we prove that the standard information-theoretic machinery can never give a bound on the regret in the general case that improveson the best known bound of $\tilde O(d^{1.5} \sqrt{n})$.