Supported Trust Region Optimization for Offline Reinforcement Learning

Yixiu Mao, Hongchang Zhang, Chen Chen, Yi Xu, Xiangyang Ji
Proceedings of the 40th International Conference on Machine Learning, PMLR 202:23829-23851, 2023.

Abstract

Offline reinforcement learning suffers from the out-of-distribution issue and extrapolation error. Most policy constraint methods regularize the density of the trained policy towards the behavior policy, which is too restrictive in most cases. We propose Supported Trust Region optimization (STR) which performs trust region policy optimization with the policy constrained within the support of the behavior policy, enjoying the less restrictive support constraint. We show that, when assuming no approximation and sampling error, STR guarantees strict policy improvement until convergence to the optimal support-constrained policy in the dataset. Further with both errors incorporated, STR still guarantees safe policy improvement for each step. Empirical results validate the theory of STR and demonstrate its state-of-the-art performance on MuJoCo locomotion domains and much more challenging AntMaze domains.

Cite this Paper


BibTeX
@InProceedings{pmlr-v202-mao23c, title = {Supported Trust Region Optimization for Offline Reinforcement Learning}, author = {Mao, Yixiu and Zhang, Hongchang and Chen, Chen and Xu, Yi and Ji, Xiangyang}, booktitle = {Proceedings of the 40th International Conference on Machine Learning}, pages = {23829--23851}, year = {2023}, editor = {Krause, Andreas and Brunskill, Emma and Cho, Kyunghyun and Engelhardt, Barbara and Sabato, Sivan and Scarlett, Jonathan}, volume = {202}, series = {Proceedings of Machine Learning Research}, month = {23--29 Jul}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v202/mao23c/mao23c.pdf}, url = {https://proceedings.mlr.press/v202/mao23c.html}, abstract = {Offline reinforcement learning suffers from the out-of-distribution issue and extrapolation error. Most policy constraint methods regularize the density of the trained policy towards the behavior policy, which is too restrictive in most cases. We propose Supported Trust Region optimization (STR) which performs trust region policy optimization with the policy constrained within the support of the behavior policy, enjoying the less restrictive support constraint. We show that, when assuming no approximation and sampling error, STR guarantees strict policy improvement until convergence to the optimal support-constrained policy in the dataset. Further with both errors incorporated, STR still guarantees safe policy improvement for each step. Empirical results validate the theory of STR and demonstrate its state-of-the-art performance on MuJoCo locomotion domains and much more challenging AntMaze domains.} }
Endnote
%0 Conference Paper %T Supported Trust Region Optimization for Offline Reinforcement Learning %A Yixiu Mao %A Hongchang Zhang %A Chen Chen %A Yi Xu %A Xiangyang Ji %B Proceedings of the 40th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2023 %E Andreas Krause %E Emma Brunskill %E Kyunghyun Cho %E Barbara Engelhardt %E Sivan Sabato %E Jonathan Scarlett %F pmlr-v202-mao23c %I PMLR %P 23829--23851 %U https://proceedings.mlr.press/v202/mao23c.html %V 202 %X Offline reinforcement learning suffers from the out-of-distribution issue and extrapolation error. Most policy constraint methods regularize the density of the trained policy towards the behavior policy, which is too restrictive in most cases. We propose Supported Trust Region optimization (STR) which performs trust region policy optimization with the policy constrained within the support of the behavior policy, enjoying the less restrictive support constraint. We show that, when assuming no approximation and sampling error, STR guarantees strict policy improvement until convergence to the optimal support-constrained policy in the dataset. Further with both errors incorporated, STR still guarantees safe policy improvement for each step. Empirical results validate the theory of STR and demonstrate its state-of-the-art performance on MuJoCo locomotion domains and much more challenging AntMaze domains.
APA
Mao, Y., Zhang, H., Chen, C., Xu, Y. & Ji, X.. (2023). Supported Trust Region Optimization for Offline Reinforcement Learning. Proceedings of the 40th International Conference on Machine Learning, in Proceedings of Machine Learning Research 202:23829-23851 Available from https://proceedings.mlr.press/v202/mao23c.html.

Related Material