Learning Policy-Aware Models for Model-Based Reinforcement Learning via Transition Occupancy Matching

Yecheng Jason Ma, Kausik Sivakumar, Jason Yan, Osbert Bastani, Dinesh Jayaraman
Proceedings of The 5th Annual Learning for Dynamics and Control Conference, PMLR 211:259-271, 2023.

Abstract

Standard model-based reinforcement learning (MBRL) approaches fit a transition model of the environment to all past experience, but this wastes model capacity on data that is irrelevant for policy improvement. We instead propose a new “transition occupancy matching” (TOM) objective for MBRL model learning: a model is good to the extent that the current policy experiences the same distribution of transitions inside the model as in the real environment. We derive TOM directly from a novel lower bound on the standard reinforcement learning objective. To optimize TOM, we show how to reduce it to a form of importance weighted maximum-likelihood estimation, where the automatically computed importance weights identify policy-relevant past experiences from a replay buffer, enabling stable optimization. TOM thus offers a plug-and-play model learning sub-routine that is compatible with any backbone MBRL algorithm. On various Mujoco continuous robotic control tasks, we show that TOM successfully focuses model learning on policy-relevant experience and drives policies faster to higher task rewards than alternative model learning approaches. The full paper and code can be found on our project website: https://penn-pal-lab.github.io/TOM/

Cite this Paper


BibTeX
@InProceedings{pmlr-v211-ma23a, title = {Learning Policy-Aware Models for Model-Based Reinforcement Learning via Transition Occupancy Matching}, author = {Ma, Yecheng Jason and Sivakumar, Kausik and Yan, Jason and Bastani, Osbert and Jayaraman, Dinesh}, booktitle = {Proceedings of The 5th Annual Learning for Dynamics and Control Conference}, pages = {259--271}, year = {2023}, editor = {Matni, Nikolai and Morari, Manfred and Pappas, George J.}, volume = {211}, series = {Proceedings of Machine Learning Research}, month = {15--16 Jun}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v211/ma23a/ma23a.pdf}, url = {https://proceedings.mlr.press/v211/ma23a.html}, abstract = {Standard model-based reinforcement learning (MBRL) approaches fit a transition model of the environment to all past experience, but this wastes model capacity on data that is irrelevant for policy improvement. We instead propose a new “transition occupancy matching” (TOM) objective for MBRL model learning: a model is good to the extent that the current policy experiences the same distribution of transitions inside the model as in the real environment. We derive TOM directly from a novel lower bound on the standard reinforcement learning objective. To optimize TOM, we show how to reduce it to a form of importance weighted maximum-likelihood estimation, where the automatically computed importance weights identify policy-relevant past experiences from a replay buffer, enabling stable optimization. TOM thus offers a plug-and-play model learning sub-routine that is compatible with any backbone MBRL algorithm. On various Mujoco continuous robotic control tasks, we show that TOM successfully focuses model learning on policy-relevant experience and drives policies faster to higher task rewards than alternative model learning approaches. The full paper and code can be found on our project website: https://penn-pal-lab.github.io/TOM/} }
Endnote
%0 Conference Paper %T Learning Policy-Aware Models for Model-Based Reinforcement Learning via Transition Occupancy Matching %A Yecheng Jason Ma %A Kausik Sivakumar %A Jason Yan %A Osbert Bastani %A Dinesh Jayaraman %B Proceedings of The 5th Annual Learning for Dynamics and Control Conference %C Proceedings of Machine Learning Research %D 2023 %E Nikolai Matni %E Manfred Morari %E George J. Pappas %F pmlr-v211-ma23a %I PMLR %P 259--271 %U https://proceedings.mlr.press/v211/ma23a.html %V 211 %X Standard model-based reinforcement learning (MBRL) approaches fit a transition model of the environment to all past experience, but this wastes model capacity on data that is irrelevant for policy improvement. We instead propose a new “transition occupancy matching” (TOM) objective for MBRL model learning: a model is good to the extent that the current policy experiences the same distribution of transitions inside the model as in the real environment. We derive TOM directly from a novel lower bound on the standard reinforcement learning objective. To optimize TOM, we show how to reduce it to a form of importance weighted maximum-likelihood estimation, where the automatically computed importance weights identify policy-relevant past experiences from a replay buffer, enabling stable optimization. TOM thus offers a plug-and-play model learning sub-routine that is compatible with any backbone MBRL algorithm. On various Mujoco continuous robotic control tasks, we show that TOM successfully focuses model learning on policy-relevant experience and drives policies faster to higher task rewards than alternative model learning approaches. The full paper and code can be found on our project website: https://penn-pal-lab.github.io/TOM/
APA
Ma, Y.J., Sivakumar, K., Yan, J., Bastani, O. & Jayaraman, D.. (2023). Learning Policy-Aware Models for Model-Based Reinforcement Learning via Transition Occupancy Matching. Proceedings of The 5th Annual Learning for Dynamics and Control Conference, in Proceedings of Machine Learning Research 211:259-271 Available from https://proceedings.mlr.press/v211/ma23a.html.

Related Material