Robust Policy Gradient against Strong Data Corruption

Xuezhou Zhang; Yiding Chen; Xiaojin Zhu; Wen Sun

Robust Policy Gradient against Strong Data Corruption

Xuezhou Zhang, Yiding Chen, Xiaojin Zhu, Wen Sun

Proceedings of the 38th International Conference on Machine Learning, PMLR 139:12391-12401, 2021.

Abstract

We study the problem of robust reinforcement learning under adversarial corruption on both rewards and transitions. Our attack model assumes an \textit{adaptive} adversary who can arbitrarily corrupt the reward and transition at every step within an episode, for at most $\epsilon$-fraction of the learning episodes. Our attack model is strictly stronger than those considered in prior works. Our first result shows that no algorithm can find a better than $O(\epsilon)$-optimal policy under our attack model. Next, we show that surprisingly the natural policy gradient (NPG) method retains a natural robustness property if the reward corruption is bounded, and can find an $O(\sqrt{\epsilon})$-optimal policy. Consequently, we develop a Filtered Policy Gradient (FPG) algorithm that can tolerate even unbounded reward corruption and can find an $O(\epsilon^{1/4})$-optimal policy. We emphasize that FPG is the first that can achieve a meaningful learning guarantee when a constant fraction of episodes are corrupted. Complimentary to the theoretical results, we show that a neural implementation of FPG achieves strong robust learning performance on the MuJoCo continuous control benchmarks.

Cite this Paper

BibTeX

@InProceedings{pmlr-v139-zhang21d,
  title = 	 {Robust Policy Gradient against Strong Data Corruption},
  author =       {Zhang, Xuezhou and Chen, Yiding and Zhu, Xiaojin and Sun, Wen},
  booktitle = 	 {Proceedings of the 38th International Conference on Machine Learning},
  pages = 	 {12391--12401},
  year = 	 {2021},
  editor = 	 {Meila, Marina and Zhang, Tong},
  volume = 	 {139},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {18--24 Jul},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v139/zhang21d/zhang21d.pdf},
  url = 	 {https://proceedings.mlr.press/v139/zhang21d.html},
  abstract = 	 {We study the problem of robust reinforcement learning under adversarial corruption on both rewards and transitions. Our attack model assumes an \textit{adaptive} adversary who can arbitrarily corrupt the reward and transition at every step within an episode, for at most $\epsilon$-fraction of the learning episodes. Our attack model is strictly stronger than those considered in prior works. Our first result shows that no algorithm can find a better than $O(\epsilon)$-optimal policy under our attack model. Next, we show that surprisingly the natural policy gradient (NPG) method retains a natural robustness property if the reward corruption is bounded, and can find an $O(\sqrt{\epsilon})$-optimal policy. Consequently, we develop a Filtered Policy Gradient (FPG) algorithm that can tolerate even unbounded reward corruption and can find an $O(\epsilon^{1/4})$-optimal policy. We emphasize that FPG is the first that can achieve a meaningful learning guarantee when a constant fraction of episodes are corrupted. Complimentary to the theoretical results, we show that a neural implementation of FPG achieves strong robust learning performance on the MuJoCo continuous control benchmarks.}
}

Endnote

%0 Conference Paper
%T Robust Policy Gradient against Strong Data Corruption
%A Xuezhou Zhang
%A Yiding Chen
%A Xiaojin Zhu
%A Wen Sun
%B Proceedings of the 38th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2021
%E Marina Meila
%E Tong Zhang	
%F pmlr-v139-zhang21d
%I PMLR
%P 12391--12401
%U https://proceedings.mlr.press/v139/zhang21d.html
%V 139
%X We study the problem of robust reinforcement learning under adversarial corruption on both rewards and transitions. Our attack model assumes an \textit{adaptive} adversary who can arbitrarily corrupt the reward and transition at every step within an episode, for at most $\epsilon$-fraction of the learning episodes. Our attack model is strictly stronger than those considered in prior works. Our first result shows that no algorithm can find a better than $O(\epsilon)$-optimal policy under our attack model. Next, we show that surprisingly the natural policy gradient (NPG) method retains a natural robustness property if the reward corruption is bounded, and can find an $O(\sqrt{\epsilon})$-optimal policy. Consequently, we develop a Filtered Policy Gradient (FPG) algorithm that can tolerate even unbounded reward corruption and can find an $O(\epsilon^{1/4})$-optimal policy. We emphasize that FPG is the first that can achieve a meaningful learning guarantee when a constant fraction of episodes are corrupted. Complimentary to the theoretical results, we show that a neural implementation of FPG achieves strong robust learning performance on the MuJoCo continuous control benchmarks.

APA

Zhang, X., Chen, Y., Zhu, X. & Sun, W.. (2021). Robust Policy Gradient against Strong Data Corruption. Proceedings of the 38th International Conference on Machine Learning, in Proceedings of Machine Learning Research 139:12391-12401 Available from https://proceedings.mlr.press/v139/zhang21d.html.

Robust Policy Gradient against Strong Data Corruption

Abstract

Cite this Paper

Related Material