Thompson Sampling for Bandit Convex Optimisation

Alireza Bakhtiari, Tor Lattimore, Csaba Szepesvári
Proceedings of Thirty Eighth Conference on Learning Theory, PMLR 291:231-263, 2025.

Abstract

We show that Thompson sampling has a Bayesian regret of at most $\tilde O(\sqrt{n})$ for $1$-dimensional bandit convex optimisation where $n$ is the time horizon and no assumptions are made on the loss function beyond convexity, boundedness and a mild Lipschitz assumption. For general high-dimensional problems we show that Thompson sampling can fail catastrophically. More positively, we show that Thompson sampling has Bayesian regret of $\tilde O(d^{2.5} \sqrt{n})$ for generalised linear bandits with an unknown convex monotone link function. Lastly, we prove that the standard information-theoretic machinery can never give a bound on the regret in the general case that improveson the best known bound of $\tilde O(d^{1.5} \sqrt{n})$.

Cite this Paper


BibTeX
@InProceedings{pmlr-v291-bakhtiari25a, title = {Thompson Sampling for Bandit Convex Optimisation}, author = {Bakhtiari, Alireza and Lattimore, Tor and Szepesv\'{a}ri, Csaba}, booktitle = {Proceedings of Thirty Eighth Conference on Learning Theory}, pages = {231--263}, year = {2025}, editor = {Haghtalab, Nika and Moitra, Ankur}, volume = {291}, series = {Proceedings of Machine Learning Research}, month = {30 Jun--04 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v291/main/assets/bakhtiari25a/bakhtiari25a.pdf}, url = {https://proceedings.mlr.press/v291/bakhtiari25a.html}, abstract = { We show that Thompson sampling has a Bayesian regret of at most $\tilde O(\sqrt{n})$ for $1$-dimensional bandit convex optimisation where $n$ is the time horizon and no assumptions are made on the loss function beyond convexity, boundedness and a mild Lipschitz assumption. For general high-dimensional problems we show that Thompson sampling can fail catastrophically. More positively, we show that Thompson sampling has Bayesian regret of $\tilde O(d^{2.5} \sqrt{n})$ for generalised linear bandits with an unknown convex monotone link function. Lastly, we prove that the standard information-theoretic machinery can never give a bound on the regret in the general case that improveson the best known bound of $\tilde O(d^{1.5} \sqrt{n})$. } }
Endnote
%0 Conference Paper %T Thompson Sampling for Bandit Convex Optimisation %A Alireza Bakhtiari %A Tor Lattimore %A Csaba Szepesvári %B Proceedings of Thirty Eighth Conference on Learning Theory %C Proceedings of Machine Learning Research %D 2025 %E Nika Haghtalab %E Ankur Moitra %F pmlr-v291-bakhtiari25a %I PMLR %P 231--263 %U https://proceedings.mlr.press/v291/bakhtiari25a.html %V 291 %X We show that Thompson sampling has a Bayesian regret of at most $\tilde O(\sqrt{n})$ for $1$-dimensional bandit convex optimisation where $n$ is the time horizon and no assumptions are made on the loss function beyond convexity, boundedness and a mild Lipschitz assumption. For general high-dimensional problems we show that Thompson sampling can fail catastrophically. More positively, we show that Thompson sampling has Bayesian regret of $\tilde O(d^{2.5} \sqrt{n})$ for generalised linear bandits with an unknown convex monotone link function. Lastly, we prove that the standard information-theoretic machinery can never give a bound on the regret in the general case that improveson the best known bound of $\tilde O(d^{1.5} \sqrt{n})$.
APA
Bakhtiari, A., Lattimore, T. & Szepesvári, C.. (2025). Thompson Sampling for Bandit Convex Optimisation. Proceedings of Thirty Eighth Conference on Learning Theory, in Proceedings of Machine Learning Research 291:231-263 Available from https://proceedings.mlr.press/v291/bakhtiari25a.html.

Related Material