Reflective Policy Optimization

Yaozhong Gan, Renye Yan, Zhe Wu, Junliang Xing
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:14471-14490, 2024.

Abstract

On-policy reinforcement learning methods, like Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO), often demand extensive data per update, leading to sample inefficiency. This paper introduces Reflective Policy Optimization (RPO), a novel on-policy extension that amalgamates past and future state-action information for policy optimization. This approach empowers the agent for introspection, allowing modifications to its actions within the current state. Theoretical analysis confirms that policy performance is monotonically improved and contracts the solution space, consequently expediting the convergence procedure. Empirical results demonstrate RPO’s feasibility and efficacy in two reinforcement learning benchmarks, culminating in superior sample efficiency. The source code of this work is available at https://github.com/Edgargan/RPO.

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-gan24b, title = {Reflective Policy Optimization}, author = {Gan, Yaozhong and Yan, Renye and Wu, Zhe and Xing, Junliang}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {14471--14490}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/gan24b/gan24b.pdf}, url = {https://proceedings.mlr.press/v235/gan24b.html}, abstract = {On-policy reinforcement learning methods, like Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO), often demand extensive data per update, leading to sample inefficiency. This paper introduces Reflective Policy Optimization (RPO), a novel on-policy extension that amalgamates past and future state-action information for policy optimization. This approach empowers the agent for introspection, allowing modifications to its actions within the current state. Theoretical analysis confirms that policy performance is monotonically improved and contracts the solution space, consequently expediting the convergence procedure. Empirical results demonstrate RPO’s feasibility and efficacy in two reinforcement learning benchmarks, culminating in superior sample efficiency. The source code of this work is available at https://github.com/Edgargan/RPO.} }
Endnote
%0 Conference Paper %T Reflective Policy Optimization %A Yaozhong Gan %A Renye Yan %A Zhe Wu %A Junliang Xing %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-gan24b %I PMLR %P 14471--14490 %U https://proceedings.mlr.press/v235/gan24b.html %V 235 %X On-policy reinforcement learning methods, like Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO), often demand extensive data per update, leading to sample inefficiency. This paper introduces Reflective Policy Optimization (RPO), a novel on-policy extension that amalgamates past and future state-action information for policy optimization. This approach empowers the agent for introspection, allowing modifications to its actions within the current state. Theoretical analysis confirms that policy performance is monotonically improved and contracts the solution space, consequently expediting the convergence procedure. Empirical results demonstrate RPO’s feasibility and efficacy in two reinforcement learning benchmarks, culminating in superior sample efficiency. The source code of this work is available at https://github.com/Edgargan/RPO.
APA
Gan, Y., Yan, R., Wu, Z. & Xing, J.. (2024). Reflective Policy Optimization. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:14471-14490 Available from https://proceedings.mlr.press/v235/gan24b.html.

Related Material