ACPO: A Policy Optimization Algorithm for Average MDPs with Constraints

Akhil Agnihotri, Rahul Jain, Haipeng Luo
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:397-415, 2024.

Abstract

Reinforcement Learning (RL) for constrained MDPs (CMDPs) is an increasingly important problem for various applications. Often, the average criterion is more suitable than the discounted criterion. Yet, RL for average-CMDPs (ACMDPs) remains a challenging problem. Algorithms designed for discounted constrained RL problems often do not perform well for the average CMDP setting. In this paper, we introduce a new policy optimization with function approximation algorithm for constrained MDPs with the average criterion. The Average-Constrained Policy Optimization (ACPO) algorithm is inspired by trust region-based policy optimization algorithms. We develop basic sensitivity theory for average CMDPs, and then use the corresponding bounds in the design of the algorithm. We provide theoretical guarantees on its performance, and through extensive experimental work in various challenging OpenAI Gym environments, show its superior empirical performance when compared to other state-of-the-art algorithms adapted for the ACMDPs.

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-agnihotri24a, title = {{ACPO}: A Policy Optimization Algorithm for Average {MDP}s with Constraints}, author = {Agnihotri, Akhil and Jain, Rahul and Luo, Haipeng}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {397--415}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/agnihotri24a/agnihotri24a.pdf}, url = {https://proceedings.mlr.press/v235/agnihotri24a.html}, abstract = {Reinforcement Learning (RL) for constrained MDPs (CMDPs) is an increasingly important problem for various applications. Often, the average criterion is more suitable than the discounted criterion. Yet, RL for average-CMDPs (ACMDPs) remains a challenging problem. Algorithms designed for discounted constrained RL problems often do not perform well for the average CMDP setting. In this paper, we introduce a new policy optimization with function approximation algorithm for constrained MDPs with the average criterion. The Average-Constrained Policy Optimization (ACPO) algorithm is inspired by trust region-based policy optimization algorithms. We develop basic sensitivity theory for average CMDPs, and then use the corresponding bounds in the design of the algorithm. We provide theoretical guarantees on its performance, and through extensive experimental work in various challenging OpenAI Gym environments, show its superior empirical performance when compared to other state-of-the-art algorithms adapted for the ACMDPs.} }
Endnote
%0 Conference Paper %T ACPO: A Policy Optimization Algorithm for Average MDPs with Constraints %A Akhil Agnihotri %A Rahul Jain %A Haipeng Luo %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-agnihotri24a %I PMLR %P 397--415 %U https://proceedings.mlr.press/v235/agnihotri24a.html %V 235 %X Reinforcement Learning (RL) for constrained MDPs (CMDPs) is an increasingly important problem for various applications. Often, the average criterion is more suitable than the discounted criterion. Yet, RL for average-CMDPs (ACMDPs) remains a challenging problem. Algorithms designed for discounted constrained RL problems often do not perform well for the average CMDP setting. In this paper, we introduce a new policy optimization with function approximation algorithm for constrained MDPs with the average criterion. The Average-Constrained Policy Optimization (ACPO) algorithm is inspired by trust region-based policy optimization algorithms. We develop basic sensitivity theory for average CMDPs, and then use the corresponding bounds in the design of the algorithm. We provide theoretical guarantees on its performance, and through extensive experimental work in various challenging OpenAI Gym environments, show its superior empirical performance when compared to other state-of-the-art algorithms adapted for the ACMDPs.
APA
Agnihotri, A., Jain, R. & Luo, H.. (2024). ACPO: A Policy Optimization Algorithm for Average MDPs with Constraints. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:397-415 Available from https://proceedings.mlr.press/v235/agnihotri24a.html.

Related Material