Optimistic Thompson Sampling-based algorithms for episodic reinforcement learning

Bingshan Hu; Tianyue H. Zhang; Nidhi Hegde; Mark Schmidt

Optimistic Thompson Sampling-based algorithms for episodic reinforcement learning

Bingshan Hu, Tianyue H. Zhang, Nidhi Hegde, Mark Schmidt

Proceedings of the Thirty-Ninth Conference on Uncertainty in Artificial Intelligence, PMLR 216:890-899, 2023.

Abstract

We propose two Thompson Sampling-like, model-based learning algorithms for episodic Markov decision processes (MDPs) with a finite time horizon. Our proposed algorithms are inspired by Optimistic Thompson Sampling (O-TS), empirically studied in Chapelle and Li [2011], May et al. [2012] for stochastic multi-armed bandits. The key idea for the original O-TS is to clip the posterior distribution in an optimistic way to ensure that the sampled models are always better than the empirical models. Both of our proposed algorithms are easy to implement and only need one posterior sample to construct an episode-dependent model. Our first algorithm, Optimistic Thompson Sampling for MDPs (O-TS-MDP), achieves a

$\widetilde{O} \left(\sqrt{AS^2H^4T} \right)$ regret bound, where

$S$ is the size of the state space,

$A$ is the size of the action space,

$H$ is the number of time-steps per episode and

$T$ is the number of episodes. Our second algorithm, Optimistic Thompson Sampling plus for MDPs (O-TS-MDP

$^+$ ), achieves the (near)-optimal

$\widetilde{O} \left(\sqrt{ASH^3T} \right)$ regret bound by taking a more aggressive clipping strategy. Since O-TS was only empirically studied previously, we derive regret bounds of O-TS for stochastic bandits. In addition, we propose, O-TS-Bandit

$^+$ , a randomized version of UCB1 [Auer et al., 2002], for stochastic bandits. Both O-TS and O-TS-Bandit

$^+$ achieve the optimal

$O\left(\frac{A\ln(T)}{\Delta} \right)$ problem-dependent regret bound, where

$\Delta$ denotes the sub-optimality gap.

Cite this Paper

BibTeX


@InProceedings{pmlr-v216-hu23a,
  title = 	 {Optimistic {T}hompson Sampling-based algorithms for episodic reinforcement learning},
  author =       {Hu, Bingshan and Zhang, Tianyue H. and Hegde, Nidhi and Schmidt, Mark},
  booktitle = 	 {Proceedings of the Thirty-Ninth Conference on Uncertainty in Artificial Intelligence},
  pages = 	 {890--899},
  year = 	 {2023},
  editor = 	 {Evans, Robin J. and Shpitser, Ilya},
  volume = 	 {216},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {31 Jul--04 Aug},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v216/hu23a/hu23a.pdf},
  url = 	 {https://proceedings.mlr.press/v216/hu23a.html},
  abstract = 	 {We propose two  Thompson Sampling-like, model-based learning algorithms  for episodic Markov decision processes (MDPs) with a finite time horizon. Our proposed algorithms are inspired by Optimistic Thompson Sampling (O-TS), empirically studied in Chapelle and Li [2011], May et al. [2012] for stochastic multi-armed bandits. The key idea for the original O-TS is to clip the posterior distribution in an optimistic way  to ensure that the sampled models are always better than the empirical models. Both of our proposed algorithms are easy to implement and only need one posterior sample to construct an episode-dependent model. Our first  algorithm, Optimistic Thompson Sampling for MDPs (O-TS-MDP), achieves a $\widetilde{O} \left(\sqrt{AS^2H^4T} \right)$ regret bound, where $S$ is the size of the state space, $A$ is the size of the action space, $H$ is the number of time-steps per episode and $T$ is the number of episodes. Our second algorithm, Optimistic Thompson Sampling plus for MDPs (O-TS-MDP$^+$),  achieves the (near)-optimal $\widetilde{O} \left(\sqrt{ASH^3T} \right)$ regret bound by taking a more aggressive clipping strategy.  Since O-TS was only empirically studied previously, we derive regret bounds of O-TS for stochastic bandits. In addition, we propose,  O-TS-Bandit$^+$, a randomized version of UCB1 [Auer et al., 2002], for stochastic bandits. Both O-TS and O-TS-Bandit$^+$ achieve the optimal $O\left(\frac{A\ln(T)}{\Delta} \right)$ problem-dependent regret bound, where $\Delta$ denotes the sub-optimality gap.}
}

Endnote

%0 Conference Paper
%T Optimistic Thompson Sampling-based algorithms for episodic reinforcement learning
%A Bingshan Hu
%A Tianyue H. Zhang
%A Nidhi Hegde
%A Mark Schmidt
%B Proceedings of the Thirty-Ninth Conference on Uncertainty in Artificial Intelligence
%C Proceedings of Machine Learning Research
%D 2023
%E Robin J. Evans
%E Ilya Shpitser	
%F pmlr-v216-hu23a
%I PMLR
%P 890--899
%U https://proceedings.mlr.press/v216/hu23a.html
%V 216
%X We propose two  Thompson Sampling-like, model-based learning algorithms  for episodic Markov decision processes (MDPs) with a finite time horizon. Our proposed algorithms are inspired by Optimistic Thompson Sampling (O-TS), empirically studied in Chapelle and Li [2011], May et al. [2012] for stochastic multi-armed bandits. The key idea for the original O-TS is to clip the posterior distribution in an optimistic way  to ensure that the sampled models are always better than the empirical models. Both of our proposed algorithms are easy to implement and only need one posterior sample to construct an episode-dependent model. Our first  algorithm, Optimistic Thompson Sampling for MDPs (O-TS-MDP), achieves a $\widetilde{O} \left(\sqrt{AS^2H^4T} \right)$ regret bound, where $S$ is the size of the state space, $A$ is the size of the action space, $H$ is the number of time-steps per episode and $T$ is the number of episodes. Our second algorithm, Optimistic Thompson Sampling plus for MDPs (O-TS-MDP$^+$),  achieves the (near)-optimal $\widetilde{O} \left(\sqrt{ASH^3T} \right)$ regret bound by taking a more aggressive clipping strategy.  Since O-TS was only empirically studied previously, we derive regret bounds of O-TS for stochastic bandits. In addition, we propose,  O-TS-Bandit$^+$, a randomized version of UCB1 [Auer et al., 2002], for stochastic bandits. Both O-TS and O-TS-Bandit$^+$ achieve the optimal $O\left(\frac{A\ln(T)}{\Delta} \right)$ problem-dependent regret bound, where $\Delta$ denotes the sub-optimality gap.

APA


Hu, B., Zhang, T.H., Hegde, N. & Schmidt, M.. (2023). Optimistic Thompson Sampling-based algorithms for episodic reinforcement learning. Proceedings of the Thirty-Ninth Conference on Uncertainty in Artificial Intelligence, in Proceedings of Machine Learning Research 216:890-899 Available from https://proceedings.mlr.press/v216/hu23a.html.

Optimistic Thompson Sampling-based algorithms for episodic reinforcement learning

Abstract

Cite this Paper

Related Material