Beyond the Policy Gradient Theorem for Efficient Policy Updates in Actor-Critic Algorithms

Romain Laroche, Remi Tachet Des Combes
Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, PMLR 151:5658-5688, 2022.

Abstract

In Reinforcement Learning, the optimal action at a given state is dependent on policy decisions at subsequent states. As a consequence, the learning targets evolve with time and the policy optimization process must be efficient at unlearning what it previously learnt. In this paper, we discover that the policy gradient theorem prescribes policy updates that are slow to unlearn because of their structural symmetry with respect to the value target. To increase the unlearning speed, we study a novel policy update: the gradient of the cross-entropy loss with respect to the action maximizing $q$, but find that such updates may lead to a decrease in value. Consequently, we introduce a modified policy update devoid of that flaw, and prove its guarantees of convergence to global optimality in $\mathcal{O}(t^{-1})$ under classic assumptions. Further, we assess standard policy updates and our cross-entropy policy updates along six analytical dimensions. Finally, we empirically validate our theoretical findings.

Cite this Paper


BibTeX
@InProceedings{pmlr-v151-laroche22a, title = { Beyond the Policy Gradient Theorem for Efficient Policy Updates in Actor-Critic Algorithms }, author = {Laroche, Romain and Tachet Des Combes, Remi}, booktitle = {Proceedings of The 25th International Conference on Artificial Intelligence and Statistics}, pages = {5658--5688}, year = {2022}, editor = {Camps-Valls, Gustau and Ruiz, Francisco J. R. and Valera, Isabel}, volume = {151}, series = {Proceedings of Machine Learning Research}, month = {28--30 Mar}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v151/laroche22a/laroche22a.pdf}, url = {https://proceedings.mlr.press/v151/laroche22a.html}, abstract = { In Reinforcement Learning, the optimal action at a given state is dependent on policy decisions at subsequent states. As a consequence, the learning targets evolve with time and the policy optimization process must be efficient at unlearning what it previously learnt. In this paper, we discover that the policy gradient theorem prescribes policy updates that are slow to unlearn because of their structural symmetry with respect to the value target. To increase the unlearning speed, we study a novel policy update: the gradient of the cross-entropy loss with respect to the action maximizing $q$, but find that such updates may lead to a decrease in value. Consequently, we introduce a modified policy update devoid of that flaw, and prove its guarantees of convergence to global optimality in $\mathcal{O}(t^{-1})$ under classic assumptions. Further, we assess standard policy updates and our cross-entropy policy updates along six analytical dimensions. Finally, we empirically validate our theoretical findings. } }
Endnote
%0 Conference Paper %T Beyond the Policy Gradient Theorem for Efficient Policy Updates in Actor-Critic Algorithms %A Romain Laroche %A Remi Tachet Des Combes %B Proceedings of The 25th International Conference on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 2022 %E Gustau Camps-Valls %E Francisco J. R. Ruiz %E Isabel Valera %F pmlr-v151-laroche22a %I PMLR %P 5658--5688 %U https://proceedings.mlr.press/v151/laroche22a.html %V 151 %X In Reinforcement Learning, the optimal action at a given state is dependent on policy decisions at subsequent states. As a consequence, the learning targets evolve with time and the policy optimization process must be efficient at unlearning what it previously learnt. In this paper, we discover that the policy gradient theorem prescribes policy updates that are slow to unlearn because of their structural symmetry with respect to the value target. To increase the unlearning speed, we study a novel policy update: the gradient of the cross-entropy loss with respect to the action maximizing $q$, but find that such updates may lead to a decrease in value. Consequently, we introduce a modified policy update devoid of that flaw, and prove its guarantees of convergence to global optimality in $\mathcal{O}(t^{-1})$ under classic assumptions. Further, we assess standard policy updates and our cross-entropy policy updates along six analytical dimensions. Finally, we empirically validate our theoretical findings.
APA
Laroche, R. & Tachet Des Combes, R.. (2022). Beyond the Policy Gradient Theorem for Efficient Policy Updates in Actor-Critic Algorithms . Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 151:5658-5688 Available from https://proceedings.mlr.press/v151/laroche22a.html.

Related Material