Policy Gradient in Robust MDPs with Global Convergence Guarantee

Qiuhao Wang; Chin Pang Ho; Marek Petrik

Policy Gradient in Robust MDPs with Global Convergence Guarantee

Qiuhao Wang, Chin Pang Ho, Marek Petrik

Proceedings of the 40th International Conference on Machine Learning, PMLR 202:35763-35797, 2023.

Abstract

Robust Markov decision processes (RMDPs) provide a promising framework for computing reliable policies in the face of model errors. Many successful reinforcement learning algorithms build on variations of policy-gradient methods, but adapting these methods to RMDPs has been challenging. As a result, the applicability of RMDPs to large, practical domains remains limited. This paper proposes a new Double-Loop Robust Policy Gradient (DRPG), the first generic policy gradient method for RMDPs. In contrast with prior robust policy gradient algorithms, DRPG monotonically reduces approximation errors to guarantee convergence to a globally optimal policy in tabular RMDPs. We introduce a novel parametric transition kernel and solve the inner loop robust policy via a gradient-based method. Finally, our numerical results demonstrate the utility of our new algorithm and confirm its global convergence properties.

Cite this Paper

BibTeX


@InProceedings{pmlr-v202-wang23i,
  title = 	 {Policy Gradient in Robust {MDP}s with Global Convergence Guarantee},
  author =       {Wang, Qiuhao and Ho, Chin Pang and Petrik, Marek},
  booktitle = 	 {Proceedings of the 40th International Conference on Machine Learning},
  pages = 	 {35763--35797},
  year = 	 {2023},
  editor = 	 {Krause, Andreas and Brunskill, Emma and Cho, Kyunghyun and Engelhardt, Barbara and Sabato, Sivan and Scarlett, Jonathan},
  volume = 	 {202},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {23--29 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v202/wang23i/wang23i.pdf},
  url = 	 {https://proceedings.mlr.press/v202/wang23i.html},
  abstract = 	 {Robust Markov decision processes (RMDPs) provide a promising framework for computing reliable policies in the face of model errors. Many successful reinforcement learning algorithms build on variations of policy-gradient methods, but adapting these methods to RMDPs has been challenging. As a result, the applicability of RMDPs to large, practical domains remains limited. This paper proposes a new Double-Loop Robust Policy Gradient (DRPG), the first generic policy gradient method for RMDPs. In contrast with prior robust policy gradient algorithms, DRPG monotonically reduces approximation errors to guarantee convergence to a globally optimal policy in tabular RMDPs. We introduce a novel parametric transition kernel and solve the inner loop robust policy via a gradient-based method. Finally, our numerical results demonstrate the utility of our new algorithm and confirm its global convergence properties.}
}

Endnote

%0 Conference Paper
%T Policy Gradient in Robust MDPs with Global Convergence Guarantee
%A Qiuhao Wang
%A Chin Pang Ho
%A Marek Petrik
%B Proceedings of the 40th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2023
%E Andreas Krause
%E Emma Brunskill
%E Kyunghyun Cho
%E Barbara Engelhardt
%E Sivan Sabato
%E Jonathan Scarlett	
%F pmlr-v202-wang23i
%I PMLR
%P 35763--35797
%U https://proceedings.mlr.press/v202/wang23i.html
%V 202
%X Robust Markov decision processes (RMDPs) provide a promising framework for computing reliable policies in the face of model errors. Many successful reinforcement learning algorithms build on variations of policy-gradient methods, but adapting these methods to RMDPs has been challenging. As a result, the applicability of RMDPs to large, practical domains remains limited. This paper proposes a new Double-Loop Robust Policy Gradient (DRPG), the first generic policy gradient method for RMDPs. In contrast with prior robust policy gradient algorithms, DRPG monotonically reduces approximation errors to guarantee convergence to a globally optimal policy in tabular RMDPs. We introduce a novel parametric transition kernel and solve the inner loop robust policy via a gradient-based method. Finally, our numerical results demonstrate the utility of our new algorithm and confirm its global convergence properties.

APA


Wang, Q., Ho, C.P. & Petrik, M.. (2023). Policy Gradient in Robust MDPs with Global Convergence Guarantee. Proceedings of the 40th International Conference on Machine Learning, in Proceedings of Machine Learning Research 202:35763-35797 Available from https://proceedings.mlr.press/v202/wang23i.html.

Policy Gradient in Robust MDPs with Global Convergence Guarantee

Abstract

Cite this Paper

Related Material