Online Restless Bandits with Unobserved States

Bowen Jiang, Bo Jiang, Jian Li, Tao Lin, Xinbing Wang, Chenghu Zhou
Proceedings of the 40th International Conference on Machine Learning, PMLR 202:15041-15066, 2023.

Abstract

We study the online restless bandit problem, where each arm evolves according to a Markov chain independently, and the reward of pulling an arm depends on both the current state of the corresponding Markov chain and the pulled arm. The agent (decision maker) does not know the transition functions and reward functions, and cannot observe the states of arms even after pulling. The goal is to sequentially choose which arms to pull so as to maximize the expected cumulative rewards collected. In this paper, we propose TSEETC, a learning algorithm based on Thompson Sampling with Episodic Explore-Then-Commit. The algorithm proceeds in episodes of increasing length and each episode is divided into exploration and exploitation phases. During the exploration phase, samples of action-reward pairs are collected in a round-robin fashion and utilized to update the posterior distribution as a mixture of Dirichlet distributions. At the beginning of the exploitation phase, TSEETC generates a sample from the posterior distribution as true parameters. It then follows the optimal policy for the sampled model for the rest of the episode. We establish the Bayesian regret bound $\tilde {\mathcal{O}}(\sqrt{T})$ for TSEETC, where $T$ is the time horizon. We show through simulations that TSEETC outperforms existing algorithms in regret.

Cite this Paper


BibTeX
@InProceedings{pmlr-v202-jiang23d, title = {Online Restless Bandits with Unobserved States}, author = {Jiang, Bowen and Jiang, Bo and Li, Jian and Lin, Tao and Wang, Xinbing and Zhou, Chenghu}, booktitle = {Proceedings of the 40th International Conference on Machine Learning}, pages = {15041--15066}, year = {2023}, editor = {Krause, Andreas and Brunskill, Emma and Cho, Kyunghyun and Engelhardt, Barbara and Sabato, Sivan and Scarlett, Jonathan}, volume = {202}, series = {Proceedings of Machine Learning Research}, month = {23--29 Jul}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v202/jiang23d/jiang23d.pdf}, url = {https://proceedings.mlr.press/v202/jiang23d.html}, abstract = {We study the online restless bandit problem, where each arm evolves according to a Markov chain independently, and the reward of pulling an arm depends on both the current state of the corresponding Markov chain and the pulled arm. The agent (decision maker) does not know the transition functions and reward functions, and cannot observe the states of arms even after pulling. The goal is to sequentially choose which arms to pull so as to maximize the expected cumulative rewards collected. In this paper, we propose TSEETC, a learning algorithm based on Thompson Sampling with Episodic Explore-Then-Commit. The algorithm proceeds in episodes of increasing length and each episode is divided into exploration and exploitation phases. During the exploration phase, samples of action-reward pairs are collected in a round-robin fashion and utilized to update the posterior distribution as a mixture of Dirichlet distributions. At the beginning of the exploitation phase, TSEETC generates a sample from the posterior distribution as true parameters. It then follows the optimal policy for the sampled model for the rest of the episode. We establish the Bayesian regret bound $\tilde {\mathcal{O}}(\sqrt{T})$ for TSEETC, where $T$ is the time horizon. We show through simulations that TSEETC outperforms existing algorithms in regret.} }
Endnote
%0 Conference Paper %T Online Restless Bandits with Unobserved States %A Bowen Jiang %A Bo Jiang %A Jian Li %A Tao Lin %A Xinbing Wang %A Chenghu Zhou %B Proceedings of the 40th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2023 %E Andreas Krause %E Emma Brunskill %E Kyunghyun Cho %E Barbara Engelhardt %E Sivan Sabato %E Jonathan Scarlett %F pmlr-v202-jiang23d %I PMLR %P 15041--15066 %U https://proceedings.mlr.press/v202/jiang23d.html %V 202 %X We study the online restless bandit problem, where each arm evolves according to a Markov chain independently, and the reward of pulling an arm depends on both the current state of the corresponding Markov chain and the pulled arm. The agent (decision maker) does not know the transition functions and reward functions, and cannot observe the states of arms even after pulling. The goal is to sequentially choose which arms to pull so as to maximize the expected cumulative rewards collected. In this paper, we propose TSEETC, a learning algorithm based on Thompson Sampling with Episodic Explore-Then-Commit. The algorithm proceeds in episodes of increasing length and each episode is divided into exploration and exploitation phases. During the exploration phase, samples of action-reward pairs are collected in a round-robin fashion and utilized to update the posterior distribution as a mixture of Dirichlet distributions. At the beginning of the exploitation phase, TSEETC generates a sample from the posterior distribution as true parameters. It then follows the optimal policy for the sampled model for the rest of the episode. We establish the Bayesian regret bound $\tilde {\mathcal{O}}(\sqrt{T})$ for TSEETC, where $T$ is the time horizon. We show through simulations that TSEETC outperforms existing algorithms in regret.
APA
Jiang, B., Jiang, B., Li, J., Lin, T., Wang, X. & Zhou, C.. (2023). Online Restless Bandits with Unobserved States. Proceedings of the 40th International Conference on Machine Learning, in Proceedings of Machine Learning Research 202:15041-15066 Available from https://proceedings.mlr.press/v202/jiang23d.html.

Related Material