Model-Bellman Inconsistency for Model-based Offline Reinforcement Learning

Yihao Sun, Jiaji Zhang, Chengxing Jia, Haoxin Lin, Junyin Ye, Yang Yu
Proceedings of the 40th International Conference on Machine Learning, PMLR 202:33177-33194, 2023.

Abstract

For offline reinforcement learning (RL), model-based methods are expected to be data-efficient as they incorporate dynamics models to generate more data. However, due to inevitable model errors, straightforwardly learning a policy in the model typically fails in the offline setting. Previous studies have incorporated conservatism to prevent out-of-distribution exploration. For example, MOPO penalizes rewards through uncertainty measures from predicting the next states, which we have discovered are loose bounds of the ideal uncertainty, i.e., the Bellman error. In this work, we propose MOdel-Bellman Inconsistency penalized offLinE Policy Optimization (MOBILE), a novel uncertainty-driven offline RL algorithm. MOBILE conducts uncertainty quantification through the inconsistency of Bellman estimations under an ensemble of learned dynamics models, which can be a better approximator to the true Bellman error, and penalizes the Bellman estimation based on this uncertainty. Empirically we have verified that our proposed uncertainty quantification can be significantly closer to the true Bellman error than the compared methods. Consequently, MOBILE outperforms prior offline RL approaches on most tasks of D4RL and NeoRL benchmarks.

Cite this Paper


BibTeX
@InProceedings{pmlr-v202-sun23q, title = {Model-{B}ellman Inconsistency for Model-based Offline Reinforcement Learning}, author = {Sun, Yihao and Zhang, Jiaji and Jia, Chengxing and Lin, Haoxin and Ye, Junyin and Yu, Yang}, booktitle = {Proceedings of the 40th International Conference on Machine Learning}, pages = {33177--33194}, year = {2023}, editor = {Krause, Andreas and Brunskill, Emma and Cho, Kyunghyun and Engelhardt, Barbara and Sabato, Sivan and Scarlett, Jonathan}, volume = {202}, series = {Proceedings of Machine Learning Research}, month = {23--29 Jul}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v202/sun23q/sun23q.pdf}, url = {https://proceedings.mlr.press/v202/sun23q.html}, abstract = {For offline reinforcement learning (RL), model-based methods are expected to be data-efficient as they incorporate dynamics models to generate more data. However, due to inevitable model errors, straightforwardly learning a policy in the model typically fails in the offline setting. Previous studies have incorporated conservatism to prevent out-of-distribution exploration. For example, MOPO penalizes rewards through uncertainty measures from predicting the next states, which we have discovered are loose bounds of the ideal uncertainty, i.e., the Bellman error. In this work, we propose MOdel-Bellman Inconsistency penalized offLinE Policy Optimization (MOBILE), a novel uncertainty-driven offline RL algorithm. MOBILE conducts uncertainty quantification through the inconsistency of Bellman estimations under an ensemble of learned dynamics models, which can be a better approximator to the true Bellman error, and penalizes the Bellman estimation based on this uncertainty. Empirically we have verified that our proposed uncertainty quantification can be significantly closer to the true Bellman error than the compared methods. Consequently, MOBILE outperforms prior offline RL approaches on most tasks of D4RL and NeoRL benchmarks.} }
Endnote
%0 Conference Paper %T Model-Bellman Inconsistency for Model-based Offline Reinforcement Learning %A Yihao Sun %A Jiaji Zhang %A Chengxing Jia %A Haoxin Lin %A Junyin Ye %A Yang Yu %B Proceedings of the 40th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2023 %E Andreas Krause %E Emma Brunskill %E Kyunghyun Cho %E Barbara Engelhardt %E Sivan Sabato %E Jonathan Scarlett %F pmlr-v202-sun23q %I PMLR %P 33177--33194 %U https://proceedings.mlr.press/v202/sun23q.html %V 202 %X For offline reinforcement learning (RL), model-based methods are expected to be data-efficient as they incorporate dynamics models to generate more data. However, due to inevitable model errors, straightforwardly learning a policy in the model typically fails in the offline setting. Previous studies have incorporated conservatism to prevent out-of-distribution exploration. For example, MOPO penalizes rewards through uncertainty measures from predicting the next states, which we have discovered are loose bounds of the ideal uncertainty, i.e., the Bellman error. In this work, we propose MOdel-Bellman Inconsistency penalized offLinE Policy Optimization (MOBILE), a novel uncertainty-driven offline RL algorithm. MOBILE conducts uncertainty quantification through the inconsistency of Bellman estimations under an ensemble of learned dynamics models, which can be a better approximator to the true Bellman error, and penalizes the Bellman estimation based on this uncertainty. Empirically we have verified that our proposed uncertainty quantification can be significantly closer to the true Bellman error than the compared methods. Consequently, MOBILE outperforms prior offline RL approaches on most tasks of D4RL and NeoRL benchmarks.
APA
Sun, Y., Zhang, J., Jia, C., Lin, H., Ye, J. & Yu, Y.. (2023). Model-Bellman Inconsistency for Model-based Offline Reinforcement Learning. Proceedings of the 40th International Conference on Machine Learning, in Proceedings of Machine Learning Research 202:33177-33194 Available from https://proceedings.mlr.press/v202/sun23q.html.

Related Material