Understanding Policy Gradient Algorithms: A Sensitivity-Based Approach

Shuang Wu; Ling Shi; Jun Wang; Guangjian Tian

Understanding Policy Gradient Algorithms: A Sensitivity-Based Approach

Shuang Wu, Ling Shi, Jun Wang, Guangjian Tian

Proceedings of the 39th International Conference on Machine Learning, PMLR 162:24131-24149, 2022.

Abstract

The REINFORCE algorithm \cite{williams1992simple} is popular in policy gradient (PG) for solving reinforcement learning (RL) problems. Meanwhile, the theoretical form of PG is from \cite{sutton1999policy}. Although both formulae prescribe PG, their precise connections are not yet illustrated. Recently, \citeauthor{nota2020policy} (\citeyear{nota2020policy}) have found that the ambiguity causes implementation errors. Motivated by the ambiguity and implementation incorrectness, we study PG from a perturbation perspective. In particular, we derive PG in a unified framework, precisely clarify the relation between PG implementation and theory, and echos back the findings by \citeauthor{nota2020policy}. Diving into factors contributing to empirical successes of the existing erroneous implementations, we find that small approximation error and the experience replay mechanism play critical roles.

Cite this Paper

BibTeX


@InProceedings{pmlr-v162-wu22i,
  title = 	 {Understanding Policy Gradient Algorithms: A Sensitivity-Based Approach},
  author =       {Wu, Shuang and Shi, Ling and Wang, Jun and Tian, Guangjian},
  booktitle = 	 {Proceedings of the 39th International Conference on Machine Learning},
  pages = 	 {24131--24149},
  year = 	 {2022},
  editor = 	 {Chaudhuri, Kamalika and Jegelka, Stefanie and Song, Le and Szepesvari, Csaba and Niu, Gang and Sabato, Sivan},
  volume = 	 {162},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {17--23 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v162/wu22i/wu22i.pdf},
  url = 	 {https://proceedings.mlr.press/v162/wu22i.html},
  abstract = 	 {The REINFORCE algorithm \cite{williams1992simple} is popular in policy gradient (PG) for solving reinforcement learning (RL) problems. Meanwhile, the theoretical form of PG is from \cite{sutton1999policy}. Although both formulae prescribe PG, their precise connections are not yet illustrated. Recently, \citeauthor{nota2020policy} (\citeyear{nota2020policy}) have found that the ambiguity causes implementation errors. Motivated by the ambiguity and implementation incorrectness, we study PG from a perturbation perspective. In particular, we derive PG in a unified framework, precisely clarify the relation between PG implementation and theory, and echos back the findings by \citeauthor{nota2020policy}. Diving into factors contributing to empirical successes of the existing erroneous implementations, we find that small approximation error and the experience replay mechanism play critical roles.}
}

Endnote

%0 Conference Paper
%T Understanding Policy Gradient Algorithms: A Sensitivity-Based Approach
%A Shuang Wu
%A Ling Shi
%A Jun Wang
%A Guangjian Tian
%B Proceedings of the 39th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2022
%E Kamalika Chaudhuri
%E Stefanie Jegelka
%E Le Song
%E Csaba Szepesvari
%E Gang Niu
%E Sivan Sabato	
%F pmlr-v162-wu22i
%I PMLR
%P 24131--24149
%U https://proceedings.mlr.press/v162/wu22i.html
%V 162
%X The REINFORCE algorithm \cite{williams1992simple} is popular in policy gradient (PG) for solving reinforcement learning (RL) problems. Meanwhile, the theoretical form of PG is from \cite{sutton1999policy}. Although both formulae prescribe PG, their precise connections are not yet illustrated. Recently, \citeauthor{nota2020policy} (\citeyear{nota2020policy}) have found that the ambiguity causes implementation errors. Motivated by the ambiguity and implementation incorrectness, we study PG from a perturbation perspective. In particular, we derive PG in a unified framework, precisely clarify the relation between PG implementation and theory, and echos back the findings by \citeauthor{nota2020policy}. Diving into factors contributing to empirical successes of the existing erroneous implementations, we find that small approximation error and the experience replay mechanism play critical roles.

APA


Wu, S., Shi, L., Wang, J. & Tian, G.. (2022). Understanding Policy Gradient Algorithms: A Sensitivity-Based Approach. Proceedings of the 39th International Conference on Machine Learning, in Proceedings of Machine Learning Research 162:24131-24149 Available from https://proceedings.mlr.press/v162/wu22i.html.

Related Material

Download PDF