Regret guarantees for model-based reinforcement learning with long-term average constraints

Mridul Agarwal, Qinbo Bai, Vaneet Aggarwal
Proceedings of the Thirty-Eighth Conference on Uncertainty in Artificial Intelligence, PMLR 180:22-31, 2022.

Abstract

We consider the problem of constrained Markov Decision Process (CMDP) where an agent interacts with an ergodic Markov Decision Process. At every interaction, the agent obtains a reward and incurs $K$ costs. The agent aims to maximize the long-term average reward while simultaneously keeping the $K$ long-term average costs lower than a certain threshold. In this paper, we propose \NAM, a posterior sampling based algorithm using which the agent can learn optimal policies to interact with the CMDP. We show that with the assumption of slackness, characterized by $\kappa$, the optimization problem is feasible for the sampled MDPs. Further, for MDP with $S$ states, $A$ actions, and mixing time $T_M$, we prove that following \NAM{} algorithm, the agent can bound the regret of not accumulating rewards from an optimal policy by $\Tilde{O}(T_MS\sqrt{AT})$. Further, we show that the violations for any of the $K$ constraints is also bounded by $\Tilde{O}(T_MS\sqrt{AT})$. To the best of our knowledge, this is the first work that obtains a $\Tilde{O}(\sqrt{T})$ regret bounds for ergodic MDPs with long-term average constraints using a posterior sampling method.

Cite this Paper


BibTeX
@InProceedings{pmlr-v180-agarwal22b, title = {Regret guarantees for model-based reinforcement learning with long-term average constraints}, author = {Agarwal, Mridul and Bai, Qinbo and Aggarwal, Vaneet}, booktitle = {Proceedings of the Thirty-Eighth Conference on Uncertainty in Artificial Intelligence}, pages = {22--31}, year = {2022}, editor = {Cussens, James and Zhang, Kun}, volume = {180}, series = {Proceedings of Machine Learning Research}, month = {01--05 Aug}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v180/agarwal22b/agarwal22b.pdf}, url = {https://proceedings.mlr.press/v180/agarwal22b.html}, abstract = {We consider the problem of constrained Markov Decision Process (CMDP) where an agent interacts with an ergodic Markov Decision Process. At every interaction, the agent obtains a reward and incurs $K$ costs. The agent aims to maximize the long-term average reward while simultaneously keeping the $K$ long-term average costs lower than a certain threshold. In this paper, we propose \NAM, a posterior sampling based algorithm using which the agent can learn optimal policies to interact with the CMDP. We show that with the assumption of slackness, characterized by $\kappa$, the optimization problem is feasible for the sampled MDPs. Further, for MDP with $S$ states, $A$ actions, and mixing time $T_M$, we prove that following \NAM{} algorithm, the agent can bound the regret of not accumulating rewards from an optimal policy by $\Tilde{O}(T_MS\sqrt{AT})$. Further, we show that the violations for any of the $K$ constraints is also bounded by $\Tilde{O}(T_MS\sqrt{AT})$. To the best of our knowledge, this is the first work that obtains a $\Tilde{O}(\sqrt{T})$ regret bounds for ergodic MDPs with long-term average constraints using a posterior sampling method.} }
Endnote
%0 Conference Paper %T Regret guarantees for model-based reinforcement learning with long-term average constraints %A Mridul Agarwal %A Qinbo Bai %A Vaneet Aggarwal %B Proceedings of the Thirty-Eighth Conference on Uncertainty in Artificial Intelligence %C Proceedings of Machine Learning Research %D 2022 %E James Cussens %E Kun Zhang %F pmlr-v180-agarwal22b %I PMLR %P 22--31 %U https://proceedings.mlr.press/v180/agarwal22b.html %V 180 %X We consider the problem of constrained Markov Decision Process (CMDP) where an agent interacts with an ergodic Markov Decision Process. At every interaction, the agent obtains a reward and incurs $K$ costs. The agent aims to maximize the long-term average reward while simultaneously keeping the $K$ long-term average costs lower than a certain threshold. In this paper, we propose \NAM, a posterior sampling based algorithm using which the agent can learn optimal policies to interact with the CMDP. We show that with the assumption of slackness, characterized by $\kappa$, the optimization problem is feasible for the sampled MDPs. Further, for MDP with $S$ states, $A$ actions, and mixing time $T_M$, we prove that following \NAM{} algorithm, the agent can bound the regret of not accumulating rewards from an optimal policy by $\Tilde{O}(T_MS\sqrt{AT})$. Further, we show that the violations for any of the $K$ constraints is also bounded by $\Tilde{O}(T_MS\sqrt{AT})$. To the best of our knowledge, this is the first work that obtains a $\Tilde{O}(\sqrt{T})$ regret bounds for ergodic MDPs with long-term average constraints using a posterior sampling method.
APA
Agarwal, M., Bai, Q. & Aggarwal, V.. (2022). Regret guarantees for model-based reinforcement learning with long-term average constraints. Proceedings of the Thirty-Eighth Conference on Uncertainty in Artificial Intelligence, in Proceedings of Machine Learning Research 180:22-31 Available from https://proceedings.mlr.press/v180/agarwal22b.html.

Related Material