First-Order Bayesian Regret Analysis of Thompson Sampling

Sébastien Bubeck, Mark Sellke
Proceedings of the 31st International Conference on Algorithmic Learning Theory, PMLR 117:196-233, 2020.

Abstract

We address online combinatorial optimization when the player has a prior over the adversary’s sequence of losses. In this setting, Russo and Van Roy proposed an information-theoretic analysis of Thompson Sampling based on the information ratio, allowing for elegant proofs of Bayesian regret bounds. In this paper we introduce three novel ideas to this line of work. First we propose a new quantity, the scale-sensitive information ratio, which allows us to obtain more refined first-order regret bounds (i.e., bounds of the form $O(\sqrt{L^*})$ where $L^*$ is the loss of the best combinatorial action). Second we replace the entropy over combinatorial actions by a coordinate entropy, which allows us to obtain the first optimal worst-case bound for Thompson Sampling in the combinatorial setting. We additionally introduce a novel link between Bayesian agents and frequentist confidence intervals. Combining these ideas we show that the classical multi-armed bandit first-order regret bound $\tilde{O}(\sqrt{d L^*})$ still holds true in the more challenging and more general semi-bandit scenario. This latter result improves the previous state of the art bound $\tilde{O}(\sqrt{(d+m^3)L^*})$ by Lykouris, Sridharan and Tardos. We tighten these results by leveraging a recent insight of Zimmert and Lattimore connecting Thompson Sampling and online stochastic mirror descent, which allows us to replace the Shannon entropy with more general mirror maps.

Cite this Paper


BibTeX
@InProceedings{pmlr-v117-bubeck20a, title = {First-Order Bayesian Regret Analysis of Thompson Sampling}, author = {Bubeck, S\'ebastien and Sellke, Mark}, booktitle = {Proceedings of the 31st International Conference on Algorithmic Learning Theory}, pages = {196--233}, year = {2020}, editor = {Kontorovich, Aryeh and Neu, Gergely}, volume = {117}, series = {Proceedings of Machine Learning Research}, month = {08 Feb--11 Feb}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v117/bubeck20a/bubeck20a.pdf}, url = {https://proceedings.mlr.press/v117/bubeck20a.html}, abstract = {We address online combinatorial optimization when the player has a prior over the adversary’s sequence of losses. In this setting, Russo and Van Roy proposed an information-theoretic analysis of Thompson Sampling based on the information ratio, allowing for elegant proofs of Bayesian regret bounds. In this paper we introduce three novel ideas to this line of work. First we propose a new quantity, the scale-sensitive information ratio, which allows us to obtain more refined first-order regret bounds (i.e., bounds of the form $O(\sqrt{L^*})$ where $L^*$ is the loss of the best combinatorial action). Second we replace the entropy over combinatorial actions by a coordinate entropy, which allows us to obtain the first optimal worst-case bound for Thompson Sampling in the combinatorial setting. We additionally introduce a novel link between Bayesian agents and frequentist confidence intervals. Combining these ideas we show that the classical multi-armed bandit first-order regret bound $\tilde{O}(\sqrt{d L^*})$ still holds true in the more challenging and more general semi-bandit scenario. This latter result improves the previous state of the art bound $\tilde{O}(\sqrt{(d+m^3)L^*})$ by Lykouris, Sridharan and Tardos. We tighten these results by leveraging a recent insight of Zimmert and Lattimore connecting Thompson Sampling and online stochastic mirror descent, which allows us to replace the Shannon entropy with more general mirror maps.} }
Endnote
%0 Conference Paper %T First-Order Bayesian Regret Analysis of Thompson Sampling %A Sébastien Bubeck %A Mark Sellke %B Proceedings of the 31st International Conference on Algorithmic Learning Theory %C Proceedings of Machine Learning Research %D 2020 %E Aryeh Kontorovich %E Gergely Neu %F pmlr-v117-bubeck20a %I PMLR %P 196--233 %U https://proceedings.mlr.press/v117/bubeck20a.html %V 117 %X We address online combinatorial optimization when the player has a prior over the adversary’s sequence of losses. In this setting, Russo and Van Roy proposed an information-theoretic analysis of Thompson Sampling based on the information ratio, allowing for elegant proofs of Bayesian regret bounds. In this paper we introduce three novel ideas to this line of work. First we propose a new quantity, the scale-sensitive information ratio, which allows us to obtain more refined first-order regret bounds (i.e., bounds of the form $O(\sqrt{L^*})$ where $L^*$ is the loss of the best combinatorial action). Second we replace the entropy over combinatorial actions by a coordinate entropy, which allows us to obtain the first optimal worst-case bound for Thompson Sampling in the combinatorial setting. We additionally introduce a novel link between Bayesian agents and frequentist confidence intervals. Combining these ideas we show that the classical multi-armed bandit first-order regret bound $\tilde{O}(\sqrt{d L^*})$ still holds true in the more challenging and more general semi-bandit scenario. This latter result improves the previous state of the art bound $\tilde{O}(\sqrt{(d+m^3)L^*})$ by Lykouris, Sridharan and Tardos. We tighten these results by leveraging a recent insight of Zimmert and Lattimore connecting Thompson Sampling and online stochastic mirror descent, which allows us to replace the Shannon entropy with more general mirror maps.
APA
Bubeck, S. & Sellke, M.. (2020). First-Order Bayesian Regret Analysis of Thompson Sampling. Proceedings of the 31st International Conference on Algorithmic Learning Theory, in Proceedings of Machine Learning Research 117:196-233 Available from https://proceedings.mlr.press/v117/bubeck20a.html.

Related Material