ReLU to the Rescue: Improve Your On-Policy Actor-Critic with Positive Advantages

Andrew Jesson, Chris Lu, Gunshi Gupta, Nicolas Beltran-Velez, Angelos Filos, Jakob Nicolaus Foerster, Yarin Gal
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:21577-21605, 2024.

Abstract

This paper proposes a step toward approximate Bayesian inference in on-policy actor-critic deep reinforcement learning. It is implemented through three changes to the Asynchronous Advantage Actor-Critic (A3C) algorithm: (1) applying a ReLU function to advantage estimates, (2) spectral normalization of actor-critic weights, and (3) incorporating dropout as a Bayesian approximation. We prove under standard assumptions that restricting policy updates to positive advantages optimizes for value by maximizing a lower bound on the value function plus an additive term. We show that the additive term is bounded proportional to the Lipschitz constant of the value function, which offers theoretical grounding for spectral normalization of critic weights. Finally, our application of dropout corresponds to approximate Bayesian inference over both the actor and critic parameters, which enables adaptive state-aware exploration around the modes of the actor via Thompson sampling. We demonstrate significant improvements for median and interquartile mean metrics over A3C, PPO, SAC, and TD3 on the MuJoCo continuous control benchmark and improvement over PPO in the challenging ProcGen generalization benchmark.

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-jesson24a, title = {{R}e{LU} to the Rescue: Improve Your On-Policy Actor-Critic with Positive Advantages}, author = {Jesson, Andrew and Lu, Chris and Gupta, Gunshi and Beltran-Velez, Nicolas and Filos, Angelos and Foerster, Jakob Nicolaus and Gal, Yarin}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {21577--21605}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/jesson24a/jesson24a.pdf}, url = {https://proceedings.mlr.press/v235/jesson24a.html}, abstract = {This paper proposes a step toward approximate Bayesian inference in on-policy actor-critic deep reinforcement learning. It is implemented through three changes to the Asynchronous Advantage Actor-Critic (A3C) algorithm: (1) applying a ReLU function to advantage estimates, (2) spectral normalization of actor-critic weights, and (3) incorporating dropout as a Bayesian approximation. We prove under standard assumptions that restricting policy updates to positive advantages optimizes for value by maximizing a lower bound on the value function plus an additive term. We show that the additive term is bounded proportional to the Lipschitz constant of the value function, which offers theoretical grounding for spectral normalization of critic weights. Finally, our application of dropout corresponds to approximate Bayesian inference over both the actor and critic parameters, which enables adaptive state-aware exploration around the modes of the actor via Thompson sampling. We demonstrate significant improvements for median and interquartile mean metrics over A3C, PPO, SAC, and TD3 on the MuJoCo continuous control benchmark and improvement over PPO in the challenging ProcGen generalization benchmark.} }
Endnote
%0 Conference Paper %T ReLU to the Rescue: Improve Your On-Policy Actor-Critic with Positive Advantages %A Andrew Jesson %A Chris Lu %A Gunshi Gupta %A Nicolas Beltran-Velez %A Angelos Filos %A Jakob Nicolaus Foerster %A Yarin Gal %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-jesson24a %I PMLR %P 21577--21605 %U https://proceedings.mlr.press/v235/jesson24a.html %V 235 %X This paper proposes a step toward approximate Bayesian inference in on-policy actor-critic deep reinforcement learning. It is implemented through three changes to the Asynchronous Advantage Actor-Critic (A3C) algorithm: (1) applying a ReLU function to advantage estimates, (2) spectral normalization of actor-critic weights, and (3) incorporating dropout as a Bayesian approximation. We prove under standard assumptions that restricting policy updates to positive advantages optimizes for value by maximizing a lower bound on the value function plus an additive term. We show that the additive term is bounded proportional to the Lipschitz constant of the value function, which offers theoretical grounding for spectral normalization of critic weights. Finally, our application of dropout corresponds to approximate Bayesian inference over both the actor and critic parameters, which enables adaptive state-aware exploration around the modes of the actor via Thompson sampling. We demonstrate significant improvements for median and interquartile mean metrics over A3C, PPO, SAC, and TD3 on the MuJoCo continuous control benchmark and improvement over PPO in the challenging ProcGen generalization benchmark.
APA
Jesson, A., Lu, C., Gupta, G., Beltran-Velez, N., Filos, A., Foerster, J.N. & Gal, Y.. (2024). ReLU to the Rescue: Improve Your On-Policy Actor-Critic with Positive Advantages. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:21577-21605 Available from https://proceedings.mlr.press/v235/jesson24a.html.

Related Material