The Virtues of Laziness in Model-based RL: A Unified Objective and Algorithms

Anirudh Vemula, Yuda Song, Aarti Singh, Drew Bagnell, Sanjiban Choudhury
Proceedings of the 40th International Conference on Machine Learning, PMLR 202:34978-35005, 2023.

Abstract

We propose a novel approach to addressing two fundamental challenges in Model-based Reinforcement Learning (MBRL): the computational expense of repeatedly finding a good policy in the learned model, and the objective mismatch between model fitting and policy computation. Our "lazy" method leverages a novel unified objective, Performance Difference via Advantage in Model, to capture the performance difference between the learned policy and expert policy under the true dynamics. This objective demonstrates that optimizing the expected policy advantage in the learned model under an exploration distribution is sufficient for policy computation, resulting in a significant boost in computational efficiency compared to traditional planning methods. Additionally, the unified objective uses a value moment matching term for model fitting, which is aligned with the model’s usage during policy computation. We present two no-regret algorithms to optimize the proposed objective, and demonstrate their statistical and computational gains compared to existing MBRL methods through simulated benchmarks.

Cite this Paper


BibTeX
@InProceedings{pmlr-v202-vemula23a, title = {The Virtues of Laziness in Model-based {RL}: A Unified Objective and Algorithms}, author = {Vemula, Anirudh and Song, Yuda and Singh, Aarti and Bagnell, Drew and Choudhury, Sanjiban}, booktitle = {Proceedings of the 40th International Conference on Machine Learning}, pages = {34978--35005}, year = {2023}, editor = {Krause, Andreas and Brunskill, Emma and Cho, Kyunghyun and Engelhardt, Barbara and Sabato, Sivan and Scarlett, Jonathan}, volume = {202}, series = {Proceedings of Machine Learning Research}, month = {23--29 Jul}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v202/vemula23a/vemula23a.pdf}, url = {https://proceedings.mlr.press/v202/vemula23a.html}, abstract = {We propose a novel approach to addressing two fundamental challenges in Model-based Reinforcement Learning (MBRL): the computational expense of repeatedly finding a good policy in the learned model, and the objective mismatch between model fitting and policy computation. Our "lazy" method leverages a novel unified objective, Performance Difference via Advantage in Model, to capture the performance difference between the learned policy and expert policy under the true dynamics. This objective demonstrates that optimizing the expected policy advantage in the learned model under an exploration distribution is sufficient for policy computation, resulting in a significant boost in computational efficiency compared to traditional planning methods. Additionally, the unified objective uses a value moment matching term for model fitting, which is aligned with the model’s usage during policy computation. We present two no-regret algorithms to optimize the proposed objective, and demonstrate their statistical and computational gains compared to existing MBRL methods through simulated benchmarks.} }
Endnote
%0 Conference Paper %T The Virtues of Laziness in Model-based RL: A Unified Objective and Algorithms %A Anirudh Vemula %A Yuda Song %A Aarti Singh %A Drew Bagnell %A Sanjiban Choudhury %B Proceedings of the 40th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2023 %E Andreas Krause %E Emma Brunskill %E Kyunghyun Cho %E Barbara Engelhardt %E Sivan Sabato %E Jonathan Scarlett %F pmlr-v202-vemula23a %I PMLR %P 34978--35005 %U https://proceedings.mlr.press/v202/vemula23a.html %V 202 %X We propose a novel approach to addressing two fundamental challenges in Model-based Reinforcement Learning (MBRL): the computational expense of repeatedly finding a good policy in the learned model, and the objective mismatch between model fitting and policy computation. Our "lazy" method leverages a novel unified objective, Performance Difference via Advantage in Model, to capture the performance difference between the learned policy and expert policy under the true dynamics. This objective demonstrates that optimizing the expected policy advantage in the learned model under an exploration distribution is sufficient for policy computation, resulting in a significant boost in computational efficiency compared to traditional planning methods. Additionally, the unified objective uses a value moment matching term for model fitting, which is aligned with the model’s usage during policy computation. We present two no-regret algorithms to optimize the proposed objective, and demonstrate their statistical and computational gains compared to existing MBRL methods through simulated benchmarks.
APA
Vemula, A., Song, Y., Singh, A., Bagnell, D. & Choudhury, S.. (2023). The Virtues of Laziness in Model-based RL: A Unified Objective and Algorithms. Proceedings of the 40th International Conference on Machine Learning, in Proceedings of Machine Learning Research 202:34978-35005 Available from https://proceedings.mlr.press/v202/vemula23a.html.

Related Material