Tight Regret and Complexity Bounds for Thompson Sampling via Langevin Monte Carlo

Tom Huix, Matthew Zhang, Alain Durmus
Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, PMLR 206:8749-8770, 2023.

Abstract

In this paper, we consider high dimensional contextual bandit problems. Within this setting, Thompson Sampling and its variants have been proposed and have been successfully applied to multiple machine learning problems. Existing theory on Thompson Sampling shows that it has suboptimal dimension dependency in contrast to upper confidence bound (UCB) algorithms. To circumvent this issue and obtain optimal regret bounds, (Zhang, 2021) recently proposed to modify Thompson Sampling by enforcing more exploration and hence is able to attain optimal regret bounds. Nonetheless, this analysis does not permit tractable implementation in high dimensions. The main challenge therein is the simulation of the posterior samples at each step given the available observations. To overcome this, we propose and analyze the use of Markov Chain Monte Carlo methods. As a corollary, we show that for contextual linear bandits, using Langevin Monte Carlo (LMC) or Metropolis Adjusted Langevin Algorithm (MALA), our algorithm attains optimal regret bounds of $\tilde{O}(d\sqrt{T})$. Furthermore, we show that this is obtained with $\tilde{O}(dT^4)$, $\tilde{O}(dT^2)$ data evaluations respectively for LMC and MALA. Finally, we validate our findings through numerical simulations and show that we outperform vanilla Thompson sampling in high dimensions.

Cite this Paper


BibTeX
@InProceedings{pmlr-v206-huix23a, title = {Tight Regret and Complexity Bounds for Thompson Sampling via Langevin Monte Carlo}, author = {Huix, Tom and Zhang, Matthew and Durmus, Alain}, booktitle = {Proceedings of The 26th International Conference on Artificial Intelligence and Statistics}, pages = {8749--8770}, year = {2023}, editor = {Ruiz, Francisco and Dy, Jennifer and van de Meent, Jan-Willem}, volume = {206}, series = {Proceedings of Machine Learning Research}, month = {25--27 Apr}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v206/huix23a/huix23a.pdf}, url = {https://proceedings.mlr.press/v206/huix23a.html}, abstract = {In this paper, we consider high dimensional contextual bandit problems. Within this setting, Thompson Sampling and its variants have been proposed and have been successfully applied to multiple machine learning problems. Existing theory on Thompson Sampling shows that it has suboptimal dimension dependency in contrast to upper confidence bound (UCB) algorithms. To circumvent this issue and obtain optimal regret bounds, (Zhang, 2021) recently proposed to modify Thompson Sampling by enforcing more exploration and hence is able to attain optimal regret bounds. Nonetheless, this analysis does not permit tractable implementation in high dimensions. The main challenge therein is the simulation of the posterior samples at each step given the available observations. To overcome this, we propose and analyze the use of Markov Chain Monte Carlo methods. As a corollary, we show that for contextual linear bandits, using Langevin Monte Carlo (LMC) or Metropolis Adjusted Langevin Algorithm (MALA), our algorithm attains optimal regret bounds of $\tilde{O}(d\sqrt{T})$. Furthermore, we show that this is obtained with $\tilde{O}(dT^4)$, $\tilde{O}(dT^2)$ data evaluations respectively for LMC and MALA. Finally, we validate our findings through numerical simulations and show that we outperform vanilla Thompson sampling in high dimensions.} }
Endnote
%0 Conference Paper %T Tight Regret and Complexity Bounds for Thompson Sampling via Langevin Monte Carlo %A Tom Huix %A Matthew Zhang %A Alain Durmus %B Proceedings of The 26th International Conference on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 2023 %E Francisco Ruiz %E Jennifer Dy %E Jan-Willem van de Meent %F pmlr-v206-huix23a %I PMLR %P 8749--8770 %U https://proceedings.mlr.press/v206/huix23a.html %V 206 %X In this paper, we consider high dimensional contextual bandit problems. Within this setting, Thompson Sampling and its variants have been proposed and have been successfully applied to multiple machine learning problems. Existing theory on Thompson Sampling shows that it has suboptimal dimension dependency in contrast to upper confidence bound (UCB) algorithms. To circumvent this issue and obtain optimal regret bounds, (Zhang, 2021) recently proposed to modify Thompson Sampling by enforcing more exploration and hence is able to attain optimal regret bounds. Nonetheless, this analysis does not permit tractable implementation in high dimensions. The main challenge therein is the simulation of the posterior samples at each step given the available observations. To overcome this, we propose and analyze the use of Markov Chain Monte Carlo methods. As a corollary, we show that for contextual linear bandits, using Langevin Monte Carlo (LMC) or Metropolis Adjusted Langevin Algorithm (MALA), our algorithm attains optimal regret bounds of $\tilde{O}(d\sqrt{T})$. Furthermore, we show that this is obtained with $\tilde{O}(dT^4)$, $\tilde{O}(dT^2)$ data evaluations respectively for LMC and MALA. Finally, we validate our findings through numerical simulations and show that we outperform vanilla Thompson sampling in high dimensions.
APA
Huix, T., Zhang, M. & Durmus, A.. (2023). Tight Regret and Complexity Bounds for Thompson Sampling via Langevin Monte Carlo. Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 206:8749-8770 Available from https://proceedings.mlr.press/v206/huix23a.html.

Related Material