Posterior sampling-based online learning for the stochastic shortest path model

Mehdi Jafarnia-Jahromi, Liyu Chen, Rahul Jain, Haipeng Luo
Proceedings of the Thirty-Ninth Conference on Uncertainty in Artificial Intelligence, PMLR 216:922-931, 2023.

Abstract

We consider the problem of online reinforcement learning for the Stochastic Shortest Path (SSP) problem modeled as an unknown MDP with an absorbing state. We propose PSRL-SSP, a simple posterior sampling-based reinforcement learning algorithm for the SSP problem. The algorithm operates in epochs. At the beginning of each epoch, a sample is drawn from the posterior distribution on the unknown model dynamics, and the optimal policy with respect to the drawn sample is followed during that epoch. An epoch completes if either the number of visits to the goal state in the current epoch exceeds that of the previous epoch, or the number of visits to any of the state-action pairs is doubled. We establish a Bayesian regret bound of $\tilde{\mathcal{O}}(B_{\ast} S\sqrt{AK})$, where $B_{\ast}$ is an upper bound on the expected cost of the optimal policy, $S$ is the size of the state space, $A$ is the size of the action space, and $K$ is the number of episodes. The algorithm only requires the knowledge of the prior distribution, and has no hyper-parameters to tune. It is the first such posterior sampling algorithm and outperforms numerically previously proposed optimism-based algorithms.

Cite this Paper


BibTeX
@InProceedings{pmlr-v216-jafarnia-jahromi23a, title = {Posterior sampling-based online learning for the stochastic shortest path model}, author = {Jafarnia-Jahromi, Mehdi and Chen, Liyu and Jain, Rahul and Luo, Haipeng}, booktitle = {Proceedings of the Thirty-Ninth Conference on Uncertainty in Artificial Intelligence}, pages = {922--931}, year = {2023}, editor = {Evans, Robin J. and Shpitser, Ilya}, volume = {216}, series = {Proceedings of Machine Learning Research}, month = {31 Jul--04 Aug}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v216/jafarnia-jahromi23a/jafarnia-jahromi23a.pdf}, url = {https://proceedings.mlr.press/v216/jafarnia-jahromi23a.html}, abstract = {We consider the problem of online reinforcement learning for the Stochastic Shortest Path (SSP) problem modeled as an unknown MDP with an absorbing state. We propose PSRL-SSP, a simple posterior sampling-based reinforcement learning algorithm for the SSP problem. The algorithm operates in epochs. At the beginning of each epoch, a sample is drawn from the posterior distribution on the unknown model dynamics, and the optimal policy with respect to the drawn sample is followed during that epoch. An epoch completes if either the number of visits to the goal state in the current epoch exceeds that of the previous epoch, or the number of visits to any of the state-action pairs is doubled. We establish a Bayesian regret bound of $\tilde{\mathcal{O}}(B_{\ast} S\sqrt{AK})$, where $B_{\ast}$ is an upper bound on the expected cost of the optimal policy, $S$ is the size of the state space, $A$ is the size of the action space, and $K$ is the number of episodes. The algorithm only requires the knowledge of the prior distribution, and has no hyper-parameters to tune. It is the first such posterior sampling algorithm and outperforms numerically previously proposed optimism-based algorithms.} }
Endnote
%0 Conference Paper %T Posterior sampling-based online learning for the stochastic shortest path model %A Mehdi Jafarnia-Jahromi %A Liyu Chen %A Rahul Jain %A Haipeng Luo %B Proceedings of the Thirty-Ninth Conference on Uncertainty in Artificial Intelligence %C Proceedings of Machine Learning Research %D 2023 %E Robin J. Evans %E Ilya Shpitser %F pmlr-v216-jafarnia-jahromi23a %I PMLR %P 922--931 %U https://proceedings.mlr.press/v216/jafarnia-jahromi23a.html %V 216 %X We consider the problem of online reinforcement learning for the Stochastic Shortest Path (SSP) problem modeled as an unknown MDP with an absorbing state. We propose PSRL-SSP, a simple posterior sampling-based reinforcement learning algorithm for the SSP problem. The algorithm operates in epochs. At the beginning of each epoch, a sample is drawn from the posterior distribution on the unknown model dynamics, and the optimal policy with respect to the drawn sample is followed during that epoch. An epoch completes if either the number of visits to the goal state in the current epoch exceeds that of the previous epoch, or the number of visits to any of the state-action pairs is doubled. We establish a Bayesian regret bound of $\tilde{\mathcal{O}}(B_{\ast} S\sqrt{AK})$, where $B_{\ast}$ is an upper bound on the expected cost of the optimal policy, $S$ is the size of the state space, $A$ is the size of the action space, and $K$ is the number of episodes. The algorithm only requires the knowledge of the prior distribution, and has no hyper-parameters to tune. It is the first such posterior sampling algorithm and outperforms numerically previously proposed optimism-based algorithms.
APA
Jafarnia-Jahromi, M., Chen, L., Jain, R. & Luo, H.. (2023). Posterior sampling-based online learning for the stochastic shortest path model. Proceedings of the Thirty-Ninth Conference on Uncertainty in Artificial Intelligence, in Proceedings of Machine Learning Research 216:922-931 Available from https://proceedings.mlr.press/v216/jafarnia-jahromi23a.html.

Related Material