A Safe Bayesian Learning Algorithm for Constrained MDPs with Bounded Constraint Violation

Krishna C Kalagarla, Rahul Jain, Pierluigi Nuzzo
Proceedings of The 28th International Conference on Artificial Intelligence and Statistics, PMLR 258:2782-2790, 2025.

Abstract

Constrained Markov decision processes (CMDPs) models are increasingly important in many applications with multiple objectives. When the model is unknown and must be learned online, it is desirable to ensure that the constraint is met, or at least the violation is bounded with time. In recent literature, progress has been made on this very challenging problem but with either unsatisfactory assumptions such as the knowledge of a safe policy, or have high cumulative regret. We propose the Safe-PSRL (posterior sampling-based RL) algorithm that does not need such assumptions and yet performs very well, both in terms of theoretical regret bounds as well as empirically. The algorithm efficiently trades-off exploration and exploitation using posterior sampling-based exploration, and yet provably suffers only bounded constraint violation using carefully-crafted pessimism. We establish a sub-linear $\tilde{O}(H^{2.5}\sqrt{|S|^2|A|K})$ upper bound on the Bayesian objective regret along with a bounded, i.e., $\tilde{O}(1)$ constraint-violation regret over $K$ episodes for an $|S|$-state, $|A|$-action, and $H$ horizon CMDP which improves over state-of-the-art algorithms for the same setting.

Cite this Paper


BibTeX
@InProceedings{pmlr-v258-kalagarla25a, title = {A Safe Bayesian Learning Algorithm for Constrained MDPs with Bounded Constraint Violation}, author = {Kalagarla, Krishna C and Jain, Rahul and Nuzzo, Pierluigi}, booktitle = {Proceedings of The 28th International Conference on Artificial Intelligence and Statistics}, pages = {2782--2790}, year = {2025}, editor = {Li, Yingzhen and Mandt, Stephan and Agrawal, Shipra and Khan, Emtiyaz}, volume = {258}, series = {Proceedings of Machine Learning Research}, month = {03--05 May}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v258/main/assets/kalagarla25a/kalagarla25a.pdf}, url = {https://proceedings.mlr.press/v258/kalagarla25a.html}, abstract = {Constrained Markov decision processes (CMDPs) models are increasingly important in many applications with multiple objectives. When the model is unknown and must be learned online, it is desirable to ensure that the constraint is met, or at least the violation is bounded with time. In recent literature, progress has been made on this very challenging problem but with either unsatisfactory assumptions such as the knowledge of a safe policy, or have high cumulative regret. We propose the Safe-PSRL (posterior sampling-based RL) algorithm that does not need such assumptions and yet performs very well, both in terms of theoretical regret bounds as well as empirically. The algorithm efficiently trades-off exploration and exploitation using posterior sampling-based exploration, and yet provably suffers only bounded constraint violation using carefully-crafted pessimism. We establish a sub-linear $\tilde{O}(H^{2.5}\sqrt{|S|^2|A|K})$ upper bound on the Bayesian objective regret along with a bounded, i.e., $\tilde{O}(1)$ constraint-violation regret over $K$ episodes for an $|S|$-state, $|A|$-action, and $H$ horizon CMDP which improves over state-of-the-art algorithms for the same setting.} }
Endnote
%0 Conference Paper %T A Safe Bayesian Learning Algorithm for Constrained MDPs with Bounded Constraint Violation %A Krishna C Kalagarla %A Rahul Jain %A Pierluigi Nuzzo %B Proceedings of The 28th International Conference on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 2025 %E Yingzhen Li %E Stephan Mandt %E Shipra Agrawal %E Emtiyaz Khan %F pmlr-v258-kalagarla25a %I PMLR %P 2782--2790 %U https://proceedings.mlr.press/v258/kalagarla25a.html %V 258 %X Constrained Markov decision processes (CMDPs) models are increasingly important in many applications with multiple objectives. When the model is unknown and must be learned online, it is desirable to ensure that the constraint is met, or at least the violation is bounded with time. In recent literature, progress has been made on this very challenging problem but with either unsatisfactory assumptions such as the knowledge of a safe policy, or have high cumulative regret. We propose the Safe-PSRL (posterior sampling-based RL) algorithm that does not need such assumptions and yet performs very well, both in terms of theoretical regret bounds as well as empirically. The algorithm efficiently trades-off exploration and exploitation using posterior sampling-based exploration, and yet provably suffers only bounded constraint violation using carefully-crafted pessimism. We establish a sub-linear $\tilde{O}(H^{2.5}\sqrt{|S|^2|A|K})$ upper bound on the Bayesian objective regret along with a bounded, i.e., $\tilde{O}(1)$ constraint-violation regret over $K$ episodes for an $|S|$-state, $|A|$-action, and $H$ horizon CMDP which improves over state-of-the-art algorithms for the same setting.
APA
Kalagarla, K.C., Jain, R. & Nuzzo, P.. (2025). A Safe Bayesian Learning Algorithm for Constrained MDPs with Bounded Constraint Violation. Proceedings of The 28th International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 258:2782-2790 Available from https://proceedings.mlr.press/v258/kalagarla25a.html.

Related Material