Policy Gradient in Robust MDPs with Global Convergence Guarantee

Qiuhao Wang, Chin Pang Ho, Marek Petrik
Proceedings of the 40th International Conference on Machine Learning, PMLR 202:35763-35797, 2023.

Abstract

Robust Markov decision processes (RMDPs) provide a promising framework for computing reliable policies in the face of model errors. Many successful reinforcement learning algorithms build on variations of policy-gradient methods, but adapting these methods to RMDPs has been challenging. As a result, the applicability of RMDPs to large, practical domains remains limited. This paper proposes a new Double-Loop Robust Policy Gradient (DRPG), the first generic policy gradient method for RMDPs. In contrast with prior robust policy gradient algorithms, DRPG monotonically reduces approximation errors to guarantee convergence to a globally optimal policy in tabular RMDPs. We introduce a novel parametric transition kernel and solve the inner loop robust policy via a gradient-based method. Finally, our numerical results demonstrate the utility of our new algorithm and confirm its global convergence properties.

Cite this Paper


BibTeX
@InProceedings{pmlr-v202-wang23i, title = {Policy Gradient in Robust {MDP}s with Global Convergence Guarantee}, author = {Wang, Qiuhao and Ho, Chin Pang and Petrik, Marek}, booktitle = {Proceedings of the 40th International Conference on Machine Learning}, pages = {35763--35797}, year = {2023}, editor = {Krause, Andreas and Brunskill, Emma and Cho, Kyunghyun and Engelhardt, Barbara and Sabato, Sivan and Scarlett, Jonathan}, volume = {202}, series = {Proceedings of Machine Learning Research}, month = {23--29 Jul}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v202/wang23i/wang23i.pdf}, url = {https://proceedings.mlr.press/v202/wang23i.html}, abstract = {Robust Markov decision processes (RMDPs) provide a promising framework for computing reliable policies in the face of model errors. Many successful reinforcement learning algorithms build on variations of policy-gradient methods, but adapting these methods to RMDPs has been challenging. As a result, the applicability of RMDPs to large, practical domains remains limited. This paper proposes a new Double-Loop Robust Policy Gradient (DRPG), the first generic policy gradient method for RMDPs. In contrast with prior robust policy gradient algorithms, DRPG monotonically reduces approximation errors to guarantee convergence to a globally optimal policy in tabular RMDPs. We introduce a novel parametric transition kernel and solve the inner loop robust policy via a gradient-based method. Finally, our numerical results demonstrate the utility of our new algorithm and confirm its global convergence properties.} }
Endnote
%0 Conference Paper %T Policy Gradient in Robust MDPs with Global Convergence Guarantee %A Qiuhao Wang %A Chin Pang Ho %A Marek Petrik %B Proceedings of the 40th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2023 %E Andreas Krause %E Emma Brunskill %E Kyunghyun Cho %E Barbara Engelhardt %E Sivan Sabato %E Jonathan Scarlett %F pmlr-v202-wang23i %I PMLR %P 35763--35797 %U https://proceedings.mlr.press/v202/wang23i.html %V 202 %X Robust Markov decision processes (RMDPs) provide a promising framework for computing reliable policies in the face of model errors. Many successful reinforcement learning algorithms build on variations of policy-gradient methods, but adapting these methods to RMDPs has been challenging. As a result, the applicability of RMDPs to large, practical domains remains limited. This paper proposes a new Double-Loop Robust Policy Gradient (DRPG), the first generic policy gradient method for RMDPs. In contrast with prior robust policy gradient algorithms, DRPG monotonically reduces approximation errors to guarantee convergence to a globally optimal policy in tabular RMDPs. We introduce a novel parametric transition kernel and solve the inner loop robust policy via a gradient-based method. Finally, our numerical results demonstrate the utility of our new algorithm and confirm its global convergence properties.
APA
Wang, Q., Ho, C.P. & Petrik, M.. (2023). Policy Gradient in Robust MDPs with Global Convergence Guarantee. Proceedings of the 40th International Conference on Machine Learning, in Proceedings of Machine Learning Research 202:35763-35797 Available from https://proceedings.mlr.press/v202/wang23i.html.

Related Material