Unified Policy Optimization for Robust Reinforcement Learning

Zichuan Lin, Li Zhao, Jiang Bian, Tao Qin, Guangwen Yang
Proceedings of The Eleventh Asian Conference on Machine Learning, PMLR 101:395-410, 2019.

Abstract

Recent years have witnessed significant progress in solving challenging problems across various domains using deep reinforcement learning (RL). Despite the success, the weak robustness has risen as a big obstacle for applying existing RL algorithms into real problems. In this paper, we propose unified policy optimization (UPO), a sample-efficient shared policy framework that allows a policy to update itself by considering different gradients generated by different policy gradient (PG) methods. Specifically, we propose two algorithms called UPO-MAB and UPO-ES, to leverage these different gradients by adopting the idea of multi-arm bandit (MAB) and evolution strategies (ES), with the purpose of finding the gradient direction leading to more performance gain with less extra data cost. Extensive experiments show that our approach can lead to stronger robustness and better performance than baselines.

Cite this Paper


BibTeX
@InProceedings{pmlr-v101-lin19a, title = {Unified Policy Optimization for Robust Reinforcement Learning}, author = {Lin, Zichuan and Zhao, Li and Bian, Jiang and Qin, Tao and Yang, Guangwen}, booktitle = {Proceedings of The Eleventh Asian Conference on Machine Learning}, pages = {395--410}, year = {2019}, editor = {Lee, Wee Sun and Suzuki, Taiji}, volume = {101}, series = {Proceedings of Machine Learning Research}, month = {17--19 Nov}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v101/lin19a/lin19a.pdf}, url = {https://proceedings.mlr.press/v101/lin19a.html}, abstract = {Recent years have witnessed significant progress in solving challenging problems across various domains using deep reinforcement learning (RL). Despite the success, the weak robustness has risen as a big obstacle for applying existing RL algorithms into real problems. In this paper, we propose unified policy optimization (UPO), a sample-efficient shared policy framework that allows a policy to update itself by considering different gradients generated by different policy gradient (PG) methods. Specifically, we propose two algorithms called UPO-MAB and UPO-ES, to leverage these different gradients by adopting the idea of multi-arm bandit (MAB) and evolution strategies (ES), with the purpose of finding the gradient direction leading to more performance gain with less extra data cost. Extensive experiments show that our approach can lead to stronger robustness and better performance than baselines.} }
Endnote
%0 Conference Paper %T Unified Policy Optimization for Robust Reinforcement Learning %A Zichuan Lin %A Li Zhao %A Jiang Bian %A Tao Qin %A Guangwen Yang %B Proceedings of The Eleventh Asian Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2019 %E Wee Sun Lee %E Taiji Suzuki %F pmlr-v101-lin19a %I PMLR %P 395--410 %U https://proceedings.mlr.press/v101/lin19a.html %V 101 %X Recent years have witnessed significant progress in solving challenging problems across various domains using deep reinforcement learning (RL). Despite the success, the weak robustness has risen as a big obstacle for applying existing RL algorithms into real problems. In this paper, we propose unified policy optimization (UPO), a sample-efficient shared policy framework that allows a policy to update itself by considering different gradients generated by different policy gradient (PG) methods. Specifically, we propose two algorithms called UPO-MAB and UPO-ES, to leverage these different gradients by adopting the idea of multi-arm bandit (MAB) and evolution strategies (ES), with the purpose of finding the gradient direction leading to more performance gain with less extra data cost. Extensive experiments show that our approach can lead to stronger robustness and better performance than baselines.
APA
Lin, Z., Zhao, L., Bian, J., Qin, T. & Yang, G.. (2019). Unified Policy Optimization for Robust Reinforcement Learning. Proceedings of The Eleventh Asian Conference on Machine Learning, in Proceedings of Machine Learning Research 101:395-410 Available from https://proceedings.mlr.press/v101/lin19a.html.

Related Material