An Analytical Update Rule for General Policy Optimization

Hepeng Li, Nicholas Clavette, Haibo He
Proceedings of the 39th International Conference on Machine Learning, PMLR 162:12696-12716, 2022.

Abstract

We present an analytical policy update rule that is independent of parametric function approximators. The policy update rule is suitable for optimizing general stochastic policies and has a monotonic improvement guarantee. It is derived from a closed-form solution to trust-region optimization using calculus of variation, following a new theoretical result that tightens existing bounds for policy improvement using trust-region methods. The update rule builds a connection between policy search methods and value function methods. Moreover, off-policy reinforcement learning algorithms can be derived from the update rule since it does not need to compute integration over on-policy states. In addition, the update rule extends immediately to cooperative multi-agent systems when policy updates are performed by one agent at a time.

Cite this Paper


BibTeX
@InProceedings{pmlr-v162-li22d, title = {An Analytical Update Rule for General Policy Optimization}, author = {Li, Hepeng and Clavette, Nicholas and He, Haibo}, booktitle = {Proceedings of the 39th International Conference on Machine Learning}, pages = {12696--12716}, year = {2022}, editor = {Chaudhuri, Kamalika and Jegelka, Stefanie and Song, Le and Szepesvari, Csaba and Niu, Gang and Sabato, Sivan}, volume = {162}, series = {Proceedings of Machine Learning Research}, month = {17--23 Jul}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v162/li22d/li22d.pdf}, url = {https://proceedings.mlr.press/v162/li22d.html}, abstract = {We present an analytical policy update rule that is independent of parametric function approximators. The policy update rule is suitable for optimizing general stochastic policies and has a monotonic improvement guarantee. It is derived from a closed-form solution to trust-region optimization using calculus of variation, following a new theoretical result that tightens existing bounds for policy improvement using trust-region methods. The update rule builds a connection between policy search methods and value function methods. Moreover, off-policy reinforcement learning algorithms can be derived from the update rule since it does not need to compute integration over on-policy states. In addition, the update rule extends immediately to cooperative multi-agent systems when policy updates are performed by one agent at a time.} }
Endnote
%0 Conference Paper %T An Analytical Update Rule for General Policy Optimization %A Hepeng Li %A Nicholas Clavette %A Haibo He %B Proceedings of the 39th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2022 %E Kamalika Chaudhuri %E Stefanie Jegelka %E Le Song %E Csaba Szepesvari %E Gang Niu %E Sivan Sabato %F pmlr-v162-li22d %I PMLR %P 12696--12716 %U https://proceedings.mlr.press/v162/li22d.html %V 162 %X We present an analytical policy update rule that is independent of parametric function approximators. The policy update rule is suitable for optimizing general stochastic policies and has a monotonic improvement guarantee. It is derived from a closed-form solution to trust-region optimization using calculus of variation, following a new theoretical result that tightens existing bounds for policy improvement using trust-region methods. The update rule builds a connection between policy search methods and value function methods. Moreover, off-policy reinforcement learning algorithms can be derived from the update rule since it does not need to compute integration over on-policy states. In addition, the update rule extends immediately to cooperative multi-agent systems when policy updates are performed by one agent at a time.
APA
Li, H., Clavette, N. & He, H.. (2022). An Analytical Update Rule for General Policy Optimization. Proceedings of the 39th International Conference on Machine Learning, in Proceedings of Machine Learning Research 162:12696-12716 Available from https://proceedings.mlr.press/v162/li22d.html.

Related Material