Langevin Thompson Sampling with Logarithmic Communication: Bandits and Reinforcement Learning

Amin Karbasi, Nikki Lijing Kuang, Yian Ma, Siddharth Mitra
Proceedings of the 40th International Conference on Machine Learning, PMLR 202:15828-15860, 2023.

Abstract

Thompson sampling (TS) is widely used in sequential decision making due to its ease of use and appealing empirical performance. However, many existing analytical and empirical results for TS rely on restrictive assumptions on reward distributions, such as belonging to conjugate families, which limits their applicability in realistic scenarios. Moreover, sequential decision making problems are often carried out in a batched manner, either due to the inherent nature of the problem or to serve the purpose of reducing communication and computation costs. In this work, we jointly study these problems in two popular settings, namely, stochastic multi-armed bandits (MABs) and infinite-horizon reinforcement learning (RL), where TS is used to learn the unknown reward distributions and transition dynamics, respectively. We propose batched Langevin Thompson Sampling algorithms that leverage MCMC methods to sample from approximate posteriors with only logarithmic communication costs in terms of batches. Our algorithms are computationally efficient and maintain the same order-optimal regret guarantees of $\mathcal{O}(\log T)$ for stochastic MABs, and $\mathcal{O}(\sqrt{T})$ for RL. We complement our theoretical findings with experimental results.

Cite this Paper


BibTeX
@InProceedings{pmlr-v202-karbasi23a, title = {{L}angevin Thompson Sampling with Logarithmic Communication: Bandits and Reinforcement Learning}, author = {Karbasi, Amin and Kuang, Nikki Lijing and Ma, Yian and Mitra, Siddharth}, booktitle = {Proceedings of the 40th International Conference on Machine Learning}, pages = {15828--15860}, year = {2023}, editor = {Krause, Andreas and Brunskill, Emma and Cho, Kyunghyun and Engelhardt, Barbara and Sabato, Sivan and Scarlett, Jonathan}, volume = {202}, series = {Proceedings of Machine Learning Research}, month = {23--29 Jul}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v202/karbasi23a/karbasi23a.pdf}, url = {https://proceedings.mlr.press/v202/karbasi23a.html}, abstract = {Thompson sampling (TS) is widely used in sequential decision making due to its ease of use and appealing empirical performance. However, many existing analytical and empirical results for TS rely on restrictive assumptions on reward distributions, such as belonging to conjugate families, which limits their applicability in realistic scenarios. Moreover, sequential decision making problems are often carried out in a batched manner, either due to the inherent nature of the problem or to serve the purpose of reducing communication and computation costs. In this work, we jointly study these problems in two popular settings, namely, stochastic multi-armed bandits (MABs) and infinite-horizon reinforcement learning (RL), where TS is used to learn the unknown reward distributions and transition dynamics, respectively. We propose batched Langevin Thompson Sampling algorithms that leverage MCMC methods to sample from approximate posteriors with only logarithmic communication costs in terms of batches. Our algorithms are computationally efficient and maintain the same order-optimal regret guarantees of $\mathcal{O}(\log T)$ for stochastic MABs, and $\mathcal{O}(\sqrt{T})$ for RL. We complement our theoretical findings with experimental results.} }
Endnote
%0 Conference Paper %T Langevin Thompson Sampling with Logarithmic Communication: Bandits and Reinforcement Learning %A Amin Karbasi %A Nikki Lijing Kuang %A Yian Ma %A Siddharth Mitra %B Proceedings of the 40th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2023 %E Andreas Krause %E Emma Brunskill %E Kyunghyun Cho %E Barbara Engelhardt %E Sivan Sabato %E Jonathan Scarlett %F pmlr-v202-karbasi23a %I PMLR %P 15828--15860 %U https://proceedings.mlr.press/v202/karbasi23a.html %V 202 %X Thompson sampling (TS) is widely used in sequential decision making due to its ease of use and appealing empirical performance. However, many existing analytical and empirical results for TS rely on restrictive assumptions on reward distributions, such as belonging to conjugate families, which limits their applicability in realistic scenarios. Moreover, sequential decision making problems are often carried out in a batched manner, either due to the inherent nature of the problem or to serve the purpose of reducing communication and computation costs. In this work, we jointly study these problems in two popular settings, namely, stochastic multi-armed bandits (MABs) and infinite-horizon reinforcement learning (RL), where TS is used to learn the unknown reward distributions and transition dynamics, respectively. We propose batched Langevin Thompson Sampling algorithms that leverage MCMC methods to sample from approximate posteriors with only logarithmic communication costs in terms of batches. Our algorithms are computationally efficient and maintain the same order-optimal regret guarantees of $\mathcal{O}(\log T)$ for stochastic MABs, and $\mathcal{O}(\sqrt{T})$ for RL. We complement our theoretical findings with experimental results.
APA
Karbasi, A., Kuang, N.L., Ma, Y. & Mitra, S.. (2023). Langevin Thompson Sampling with Logarithmic Communication: Bandits and Reinforcement Learning. Proceedings of the 40th International Conference on Machine Learning, in Proceedings of Machine Learning Research 202:15828-15860 Available from https://proceedings.mlr.press/v202/karbasi23a.html.

Related Material