Policy-conditioned Environment Models are More Generalizable

Ruifeng Chen, Xiong-Hui Chen, Yihao Sun, Siyuan Xiao, Minhui Li, Yang Yu
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:6539-6561, 2024.

Abstract

In reinforcement learning, it is crucial to have an accurate environment dynamics model to evaluate different policies’ value in downstream tasks like offline policy optimization and policy evaluation. However, the learned model is known to be inaccurate in predictions when evaluating target policies different from data-collection policies. In this work, we found that utilizing policy representation for model learning, called policy-conditioned model (PCM) learning, is useful to mitigate the problem, especially when the offline dataset is collected from diversified behavior policies. The reason beyond that is in this case, PCM becomes a meta-dynamics model that is trained to be aware of and focus on the evaluation policies that on-the-fly adjust the model to be suitable to the evaluation policies’ state-action distribution, thus improving the prediction accuracy. Based on that intuition, we propose an easy-to-implement yet effective algorithm of PCM for accurate model learning. We also give a theoretical analysis and experimental evidence to demonstrate the feasibility of reducing value gaps by adapting the dynamics model under different policies. Experiment results show that PCM outperforms the existing SOTA off-policy evaluation methods in the DOPE benchmark by a large margin, and derives significantly better policies in offline policy selection and model predictive control compared with the standard model learning method.

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-chen24g, title = {Policy-conditioned Environment Models are More Generalizable}, author = {Chen, Ruifeng and Chen, Xiong-Hui and Sun, Yihao and Xiao, Siyuan and Li, Minhui and Yu, Yang}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {6539--6561}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/chen24g/chen24g.pdf}, url = {https://proceedings.mlr.press/v235/chen24g.html}, abstract = {In reinforcement learning, it is crucial to have an accurate environment dynamics model to evaluate different policies’ value in downstream tasks like offline policy optimization and policy evaluation. However, the learned model is known to be inaccurate in predictions when evaluating target policies different from data-collection policies. In this work, we found that utilizing policy representation for model learning, called policy-conditioned model (PCM) learning, is useful to mitigate the problem, especially when the offline dataset is collected from diversified behavior policies. The reason beyond that is in this case, PCM becomes a meta-dynamics model that is trained to be aware of and focus on the evaluation policies that on-the-fly adjust the model to be suitable to the evaluation policies’ state-action distribution, thus improving the prediction accuracy. Based on that intuition, we propose an easy-to-implement yet effective algorithm of PCM for accurate model learning. We also give a theoretical analysis and experimental evidence to demonstrate the feasibility of reducing value gaps by adapting the dynamics model under different policies. Experiment results show that PCM outperforms the existing SOTA off-policy evaluation methods in the DOPE benchmark by a large margin, and derives significantly better policies in offline policy selection and model predictive control compared with the standard model learning method.} }
Endnote
%0 Conference Paper %T Policy-conditioned Environment Models are More Generalizable %A Ruifeng Chen %A Xiong-Hui Chen %A Yihao Sun %A Siyuan Xiao %A Minhui Li %A Yang Yu %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-chen24g %I PMLR %P 6539--6561 %U https://proceedings.mlr.press/v235/chen24g.html %V 235 %X In reinforcement learning, it is crucial to have an accurate environment dynamics model to evaluate different policies’ value in downstream tasks like offline policy optimization and policy evaluation. However, the learned model is known to be inaccurate in predictions when evaluating target policies different from data-collection policies. In this work, we found that utilizing policy representation for model learning, called policy-conditioned model (PCM) learning, is useful to mitigate the problem, especially when the offline dataset is collected from diversified behavior policies. The reason beyond that is in this case, PCM becomes a meta-dynamics model that is trained to be aware of and focus on the evaluation policies that on-the-fly adjust the model to be suitable to the evaluation policies’ state-action distribution, thus improving the prediction accuracy. Based on that intuition, we propose an easy-to-implement yet effective algorithm of PCM for accurate model learning. We also give a theoretical analysis and experimental evidence to demonstrate the feasibility of reducing value gaps by adapting the dynamics model under different policies. Experiment results show that PCM outperforms the existing SOTA off-policy evaluation methods in the DOPE benchmark by a large margin, and derives significantly better policies in offline policy selection and model predictive control compared with the standard model learning method.
APA
Chen, R., Chen, X., Sun, Y., Xiao, S., Li, M. & Yu, Y.. (2024). Policy-conditioned Environment Models are More Generalizable. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:6539-6561 Available from https://proceedings.mlr.press/v235/chen24g.html.

Related Material