Improving Value Estimation Critically Enhances Vanilla Policy Gradient

Tao Wang, Ruipeng Zhang, Sicun Gao
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:65449-65463, 2025.

Abstract

Modern policy gradient algorithms, such as TRPO and PPO, outperform vanilla policy gradient in many RL tasks. Questioning the common belief that enforcing approximate trust regions leads to steady policy improvement in practice, we show that the more critical factor is the enhanced value estimation accuracy from more value update steps in each iteration. To demonstrate, we show that by simply increasing the number of value update steps per iteration, vanilla policy gradient itself can achieve performance comparable to or better than PPO in all the standard continuous control benchmark environments. Importantly, this simple change to vanilla policy gradient is significantly more robust to hyperparameter choices, opening up the possibility that RL algorithms may still become more effective and easier to use.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-wang25em, title = {Improving Value Estimation Critically Enhances Vanilla Policy Gradient}, author = {Wang, Tao and Zhang, Ruipeng and Gao, Sicun}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {65449--65463}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/wang25em/wang25em.pdf}, url = {https://proceedings.mlr.press/v267/wang25em.html}, abstract = {Modern policy gradient algorithms, such as TRPO and PPO, outperform vanilla policy gradient in many RL tasks. Questioning the common belief that enforcing approximate trust regions leads to steady policy improvement in practice, we show that the more critical factor is the enhanced value estimation accuracy from more value update steps in each iteration. To demonstrate, we show that by simply increasing the number of value update steps per iteration, vanilla policy gradient itself can achieve performance comparable to or better than PPO in all the standard continuous control benchmark environments. Importantly, this simple change to vanilla policy gradient is significantly more robust to hyperparameter choices, opening up the possibility that RL algorithms may still become more effective and easier to use.} }
Endnote
%0 Conference Paper %T Improving Value Estimation Critically Enhances Vanilla Policy Gradient %A Tao Wang %A Ruipeng Zhang %A Sicun Gao %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-wang25em %I PMLR %P 65449--65463 %U https://proceedings.mlr.press/v267/wang25em.html %V 267 %X Modern policy gradient algorithms, such as TRPO and PPO, outperform vanilla policy gradient in many RL tasks. Questioning the common belief that enforcing approximate trust regions leads to steady policy improvement in practice, we show that the more critical factor is the enhanced value estimation accuracy from more value update steps in each iteration. To demonstrate, we show that by simply increasing the number of value update steps per iteration, vanilla policy gradient itself can achieve performance comparable to or better than PPO in all the standard continuous control benchmark environments. Importantly, this simple change to vanilla policy gradient is significantly more robust to hyperparameter choices, opening up the possibility that RL algorithms may still become more effective and easier to use.
APA
Wang, T., Zhang, R. & Gao, S.. (2025). Improving Value Estimation Critically Enhances Vanilla Policy Gradient. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:65449-65463 Available from https://proceedings.mlr.press/v267/wang25em.html.

Related Material