Monotonic Robust Policy Optimization with Model Discrepancy

Yuankun Jiang, Chenglin Li, Wenrui Dai, Junni Zou, Hongkai Xiong
Proceedings of the 38th International Conference on Machine Learning, PMLR 139:4951-4960, 2021.

Abstract

State-of-the-art deep reinforcement learning (DRL) algorithms tend to overfit due to the model discrepancy between source and target environments. Though applying domain randomization during training can improve the average performance by randomly generating a sufficient diversity of environments in simulator, the worst-case environment is still neglected without any performance guarantee. Since the average and worst-case performance are both important for generalization in RL, in this paper, we propose a policy optimization approach for concurrently improving the policy’s performance in the average and worst-case environment. We theoretically derive a lower bound for the worst-case performance of a given policy by relating it to the expected performance. Guided by this lower bound, we formulate an optimization problem to jointly optimize the policy and sampling distribution, and prove that by iteratively solving it the worst-case performance is monotonically improved. We then develop a practical algorithm, named monotonic robust policy optimization (MRPO). Experimental evaluations in several robot control tasks demonstrate that MRPO can generally improve both the average and worst-case performance in the source environments for training, and facilitate in all cases the learned policy with a better generalization capability in some unseen testing environments.

Cite this Paper


BibTeX
@InProceedings{pmlr-v139-jiang21c, title = {Monotonic Robust Policy Optimization with Model Discrepancy}, author = {Jiang, Yuankun and Li, Chenglin and Dai, Wenrui and Zou, Junni and Xiong, Hongkai}, booktitle = {Proceedings of the 38th International Conference on Machine Learning}, pages = {4951--4960}, year = {2021}, editor = {Meila, Marina and Zhang, Tong}, volume = {139}, series = {Proceedings of Machine Learning Research}, month = {18--24 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v139/jiang21c/jiang21c.pdf}, url = {https://proceedings.mlr.press/v139/jiang21c.html}, abstract = {State-of-the-art deep reinforcement learning (DRL) algorithms tend to overfit due to the model discrepancy between source and target environments. Though applying domain randomization during training can improve the average performance by randomly generating a sufficient diversity of environments in simulator, the worst-case environment is still neglected without any performance guarantee. Since the average and worst-case performance are both important for generalization in RL, in this paper, we propose a policy optimization approach for concurrently improving the policy’s performance in the average and worst-case environment. We theoretically derive a lower bound for the worst-case performance of a given policy by relating it to the expected performance. Guided by this lower bound, we formulate an optimization problem to jointly optimize the policy and sampling distribution, and prove that by iteratively solving it the worst-case performance is monotonically improved. We then develop a practical algorithm, named monotonic robust policy optimization (MRPO). Experimental evaluations in several robot control tasks demonstrate that MRPO can generally improve both the average and worst-case performance in the source environments for training, and facilitate in all cases the learned policy with a better generalization capability in some unseen testing environments.} }
Endnote
%0 Conference Paper %T Monotonic Robust Policy Optimization with Model Discrepancy %A Yuankun Jiang %A Chenglin Li %A Wenrui Dai %A Junni Zou %A Hongkai Xiong %B Proceedings of the 38th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2021 %E Marina Meila %E Tong Zhang %F pmlr-v139-jiang21c %I PMLR %P 4951--4960 %U https://proceedings.mlr.press/v139/jiang21c.html %V 139 %X State-of-the-art deep reinforcement learning (DRL) algorithms tend to overfit due to the model discrepancy between source and target environments. Though applying domain randomization during training can improve the average performance by randomly generating a sufficient diversity of environments in simulator, the worst-case environment is still neglected without any performance guarantee. Since the average and worst-case performance are both important for generalization in RL, in this paper, we propose a policy optimization approach for concurrently improving the policy’s performance in the average and worst-case environment. We theoretically derive a lower bound for the worst-case performance of a given policy by relating it to the expected performance. Guided by this lower bound, we formulate an optimization problem to jointly optimize the policy and sampling distribution, and prove that by iteratively solving it the worst-case performance is monotonically improved. We then develop a practical algorithm, named monotonic robust policy optimization (MRPO). Experimental evaluations in several robot control tasks demonstrate that MRPO can generally improve both the average and worst-case performance in the source environments for training, and facilitate in all cases the learned policy with a better generalization capability in some unseen testing environments.
APA
Jiang, Y., Li, C., Dai, W., Zou, J. & Xiong, H.. (2021). Monotonic Robust Policy Optimization with Model Discrepancy. Proceedings of the 38th International Conference on Machine Learning, in Proceedings of Machine Learning Research 139:4951-4960 Available from https://proceedings.mlr.press/v139/jiang21c.html.

Related Material