On Proximal Policy Optimization’s Heavy-tailed Gradients

Saurabh Garg; Joshua Zhanson; Emilio Parisotto; Adarsh Prasad; Zico Kolter; Zachary Lipton; Sivaraman Balakrishnan; Ruslan Salakhutdinov; Pradeep Ravikumar

On Proximal Policy Optimization’s Heavy-tailed Gradients

Saurabh Garg, Joshua Zhanson, Emilio Parisotto, Adarsh Prasad, Zico Kolter, Zachary Lipton, Sivaraman Balakrishnan, Ruslan Salakhutdinov, Pradeep Ravikumar

Proceedings of the 38th International Conference on Machine Learning, PMLR 139:3610-3619, 2021.

Abstract

Modern policy gradient algorithms such as Proximal Policy Optimization (PPO) rely on an arsenal of heuristics, including loss clipping and gradient clipping, to ensure successful learning. These heuristics are reminiscent of techniques from robust statistics, commonly used for estimation in outlier-rich ("heavy-tailed") regimes. In this paper, we present a detailed empirical study to characterize the heavy-tailed nature of the gradients of the PPO surrogate reward function. We demonstrate that the gradients, especially for the actor network, exhibit pronounced heavy-tailedness and that it increases as the agent’s policy diverges from the behavioral policy (i.e., as the agent goes further off policy). Further examination implicates the likelihood ratios and advantages in the surrogate reward as the main sources of the observed heavy-tailedness. We then highlight issues arising due to the heavy-tailed nature of the gradients. In this light, we study the effects of the standard PPO clipping heuristics, demonstrating that these tricks primarily serve to offset heavy-tailedness in gradients. Thus motivated, we propose incorporating GMOM, a high-dimensional robust estimator, into PPO as a substitute for three clipping tricks. Despite requiring less hyperparameter tuning, our method matches the performance of PPO (with all heuristics enabled) on a battery of MuJoCo continuous control tasks.

Cite this Paper

BibTeX

@InProceedings{pmlr-v139-garg21b,
  title = 	 {On Proximal Policy Optimization’s Heavy-tailed Gradients},
  author =       {Garg, Saurabh and Zhanson, Joshua and Parisotto, Emilio and Prasad, Adarsh and Kolter, Zico and Lipton, Zachary and Balakrishnan, Sivaraman and Salakhutdinov, Ruslan and Ravikumar, Pradeep},
  booktitle = 	 {Proceedings of the 38th International Conference on Machine Learning},
  pages = 	 {3610--3619},
  year = 	 {2021},
  editor = 	 {Meila, Marina and Zhang, Tong},
  volume = 	 {139},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {18--24 Jul},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v139/garg21b/garg21b.pdf},
  url = 	 {https://proceedings.mlr.press/v139/garg21b.html},
  abstract = 	 {Modern policy gradient algorithms such as Proximal Policy Optimization (PPO) rely on an arsenal of heuristics, including loss clipping and gradient clipping, to ensure successful learning. These heuristics are reminiscent of techniques from robust statistics, commonly used for estimation in outlier-rich ("heavy-tailed") regimes. In this paper, we present a detailed empirical study to characterize the heavy-tailed nature of the gradients of the PPO surrogate reward function. We demonstrate that the gradients, especially for the actor network, exhibit pronounced heavy-tailedness and that it increases as the agent’s policy diverges from the behavioral policy (i.e., as the agent goes further off policy). Further examination implicates the likelihood ratios and advantages in the surrogate reward as the main sources of the observed heavy-tailedness. We then highlight issues arising due to the heavy-tailed nature of the gradients. In this light, we study the effects of the standard PPO clipping heuristics, demonstrating that these tricks primarily serve to offset heavy-tailedness in gradients. Thus motivated, we propose incorporating GMOM, a high-dimensional robust estimator, into PPO as a substitute for three clipping tricks. Despite requiring less hyperparameter tuning, our method matches the performance of PPO (with all heuristics enabled) on a battery of MuJoCo continuous control tasks.}
}

Endnote

%0 Conference Paper
%T On Proximal Policy Optimization’s Heavy-tailed Gradients
%A Saurabh Garg
%A Joshua Zhanson
%A Emilio Parisotto
%A Adarsh Prasad
%A Zico Kolter
%A Zachary Lipton
%A Sivaraman Balakrishnan
%A Ruslan Salakhutdinov
%A Pradeep Ravikumar
%B Proceedings of the 38th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2021
%E Marina Meila
%E Tong Zhang	
%F pmlr-v139-garg21b
%I PMLR
%P 3610--3619
%U https://proceedings.mlr.press/v139/garg21b.html
%V 139
%X Modern policy gradient algorithms such as Proximal Policy Optimization (PPO) rely on an arsenal of heuristics, including loss clipping and gradient clipping, to ensure successful learning. These heuristics are reminiscent of techniques from robust statistics, commonly used for estimation in outlier-rich ("heavy-tailed") regimes. In this paper, we present a detailed empirical study to characterize the heavy-tailed nature of the gradients of the PPO surrogate reward function. We demonstrate that the gradients, especially for the actor network, exhibit pronounced heavy-tailedness and that it increases as the agent’s policy diverges from the behavioral policy (i.e., as the agent goes further off policy). Further examination implicates the likelihood ratios and advantages in the surrogate reward as the main sources of the observed heavy-tailedness. We then highlight issues arising due to the heavy-tailed nature of the gradients. In this light, we study the effects of the standard PPO clipping heuristics, demonstrating that these tricks primarily serve to offset heavy-tailedness in gradients. Thus motivated, we propose incorporating GMOM, a high-dimensional robust estimator, into PPO as a substitute for three clipping tricks. Despite requiring less hyperparameter tuning, our method matches the performance of PPO (with all heuristics enabled) on a battery of MuJoCo continuous control tasks.

APA

Garg, S., Zhanson, J., Parisotto, E., Prasad, A., Kolter, Z., Lipton, Z., Balakrishnan, S., Salakhutdinov, R. & Ravikumar, P.. (2021). On Proximal Policy Optimization’s Heavy-tailed Gradients. Proceedings of the 38th International Conference on Machine Learning, in Proceedings of Machine Learning Research 139:3610-3619 Available from https://proceedings.mlr.press/v139/garg21b.html.

On Proximal Policy Optimization’s Heavy-tailed Gradients

Abstract

Cite this Paper

Related Material