Towards Robust and Safe Reinforcement Learning with Benign Off-policy Data

Zuxin Liu, Zijian Guo, Zhepeng Cen, Huan Zhang, Yihang Yao, Hanjiang Hu, Ding Zhao
Proceedings of the 40th International Conference on Machine Learning, PMLR 202:21586-21610, 2023.

Abstract

Previous work demonstrates that the optimal safe reinforcement learning policy in a noise-free environment is vulnerable and could be unsafe under observational attacks. While adversarial training effectively improves robustness and safety, collecting samples by attacking the behavior agent online could be expensive or prohibitively dangerous in many applications. We propose the robuSt vAriational ofF-policy lEaRning (SAFER) approach, which only requires benign training data without attacking the agent. SAFER obtains an optimal non-parametric variational policy distribution via convex optimization and then uses it to improve the parameterized policy robustly via supervised learning. The two-stage policy optimization facilitates robust training, and extensive experiments on multiple robot platforms show the efficiency of SAFER in learning a robust and safe policy: achieving the same reward with much fewer constraint violations during training than on-policy baselines.

Cite this Paper


BibTeX
@InProceedings{pmlr-v202-liu23l, title = {Towards Robust and Safe Reinforcement Learning with Benign Off-policy Data}, author = {Liu, Zuxin and Guo, Zijian and Cen, Zhepeng and Zhang, Huan and Yao, Yihang and Hu, Hanjiang and Zhao, Ding}, booktitle = {Proceedings of the 40th International Conference on Machine Learning}, pages = {21586--21610}, year = {2023}, editor = {Krause, Andreas and Brunskill, Emma and Cho, Kyunghyun and Engelhardt, Barbara and Sabato, Sivan and Scarlett, Jonathan}, volume = {202}, series = {Proceedings of Machine Learning Research}, month = {23--29 Jul}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v202/liu23l/liu23l.pdf}, url = {https://proceedings.mlr.press/v202/liu23l.html}, abstract = {Previous work demonstrates that the optimal safe reinforcement learning policy in a noise-free environment is vulnerable and could be unsafe under observational attacks. While adversarial training effectively improves robustness and safety, collecting samples by attacking the behavior agent online could be expensive or prohibitively dangerous in many applications. We propose the robuSt vAriational ofF-policy lEaRning (SAFER) approach, which only requires benign training data without attacking the agent. SAFER obtains an optimal non-parametric variational policy distribution via convex optimization and then uses it to improve the parameterized policy robustly via supervised learning. The two-stage policy optimization facilitates robust training, and extensive experiments on multiple robot platforms show the efficiency of SAFER in learning a robust and safe policy: achieving the same reward with much fewer constraint violations during training than on-policy baselines.} }
Endnote
%0 Conference Paper %T Towards Robust and Safe Reinforcement Learning with Benign Off-policy Data %A Zuxin Liu %A Zijian Guo %A Zhepeng Cen %A Huan Zhang %A Yihang Yao %A Hanjiang Hu %A Ding Zhao %B Proceedings of the 40th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2023 %E Andreas Krause %E Emma Brunskill %E Kyunghyun Cho %E Barbara Engelhardt %E Sivan Sabato %E Jonathan Scarlett %F pmlr-v202-liu23l %I PMLR %P 21586--21610 %U https://proceedings.mlr.press/v202/liu23l.html %V 202 %X Previous work demonstrates that the optimal safe reinforcement learning policy in a noise-free environment is vulnerable and could be unsafe under observational attacks. While adversarial training effectively improves robustness and safety, collecting samples by attacking the behavior agent online could be expensive or prohibitively dangerous in many applications. We propose the robuSt vAriational ofF-policy lEaRning (SAFER) approach, which only requires benign training data without attacking the agent. SAFER obtains an optimal non-parametric variational policy distribution via convex optimization and then uses it to improve the parameterized policy robustly via supervised learning. The two-stage policy optimization facilitates robust training, and extensive experiments on multiple robot platforms show the efficiency of SAFER in learning a robust and safe policy: achieving the same reward with much fewer constraint violations during training than on-policy baselines.
APA
Liu, Z., Guo, Z., Cen, Z., Zhang, H., Yao, Y., Hu, H. & Zhao, D.. (2023). Towards Robust and Safe Reinforcement Learning with Benign Off-policy Data. Proceedings of the 40th International Conference on Machine Learning, in Proceedings of Machine Learning Research 202:21586-21610 Available from https://proceedings.mlr.press/v202/liu23l.html.

Related Material