Don’t Be Pessimistic Too Early: Look K Steps Ahead!

Chaoqi Wang, Ziyu Ye, Kevin Murphy, Yuxin Chen
Proceedings of The 27th International Conference on Artificial Intelligence and Statistics, PMLR 238:3313-3321, 2024.

Abstract

Offline reinforcement learning (RL) considers to train highly rewarding policies exclusively from existing data, showing great real-world impacts. Pessimism, \emph{i.e.}, avoiding uncertain states or actions during decision making, has long been the main theme for offline RL. However, existing works often lead to overly conservative policies with rather sub-optimal performance. To tackle this challenge, we introduce the notion of \emph{lookahead pessimism} within the model-based offline RL paradigm. Intuitively, while the classical pessimism principle asks to terminate whenever the RL agent reaches an uncertain region, our method allows the agent to use a lookahead set carefully crafted from the learned model, and to make a move by properties of the lookahead set. Remarkably, we show that this enables learning a less conservative policy with a better performance guarantee. We refer to our method as Lookahead Pessimistic MDP (LP-MDP). Theoretically, we provide a rigorous analysis on the performance lower bound, which monotonically improves with the lookahead steps. Empirically, with the easy-to-implement design of LP-MDP, we demonstrate a solid performance improvement over baseline methods on widely used offline RL benchmarks.

Cite this Paper


BibTeX
@InProceedings{pmlr-v238-wang24h, title = {Don’t Be Pessimistic Too Early: Look {K} Steps Ahead!}, author = {Wang, Chaoqi and Ye, Ziyu and Murphy, Kevin and Chen, Yuxin}, booktitle = {Proceedings of The 27th International Conference on Artificial Intelligence and Statistics}, pages = {3313--3321}, year = {2024}, editor = {Dasgupta, Sanjoy and Mandt, Stephan and Li, Yingzhen}, volume = {238}, series = {Proceedings of Machine Learning Research}, month = {02--04 May}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v238/wang24h/wang24h.pdf}, url = {https://proceedings.mlr.press/v238/wang24h.html}, abstract = {Offline reinforcement learning (RL) considers to train highly rewarding policies exclusively from existing data, showing great real-world impacts. Pessimism, \emph{i.e.}, avoiding uncertain states or actions during decision making, has long been the main theme for offline RL. However, existing works often lead to overly conservative policies with rather sub-optimal performance. To tackle this challenge, we introduce the notion of \emph{lookahead pessimism} within the model-based offline RL paradigm. Intuitively, while the classical pessimism principle asks to terminate whenever the RL agent reaches an uncertain region, our method allows the agent to use a lookahead set carefully crafted from the learned model, and to make a move by properties of the lookahead set. Remarkably, we show that this enables learning a less conservative policy with a better performance guarantee. We refer to our method as Lookahead Pessimistic MDP (LP-MDP). Theoretically, we provide a rigorous analysis on the performance lower bound, which monotonically improves with the lookahead steps. Empirically, with the easy-to-implement design of LP-MDP, we demonstrate a solid performance improvement over baseline methods on widely used offline RL benchmarks.} }
Endnote
%0 Conference Paper %T Don’t Be Pessimistic Too Early: Look K Steps Ahead! %A Chaoqi Wang %A Ziyu Ye %A Kevin Murphy %A Yuxin Chen %B Proceedings of The 27th International Conference on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 2024 %E Sanjoy Dasgupta %E Stephan Mandt %E Yingzhen Li %F pmlr-v238-wang24h %I PMLR %P 3313--3321 %U https://proceedings.mlr.press/v238/wang24h.html %V 238 %X Offline reinforcement learning (RL) considers to train highly rewarding policies exclusively from existing data, showing great real-world impacts. Pessimism, \emph{i.e.}, avoiding uncertain states or actions during decision making, has long been the main theme for offline RL. However, existing works often lead to overly conservative policies with rather sub-optimal performance. To tackle this challenge, we introduce the notion of \emph{lookahead pessimism} within the model-based offline RL paradigm. Intuitively, while the classical pessimism principle asks to terminate whenever the RL agent reaches an uncertain region, our method allows the agent to use a lookahead set carefully crafted from the learned model, and to make a move by properties of the lookahead set. Remarkably, we show that this enables learning a less conservative policy with a better performance guarantee. We refer to our method as Lookahead Pessimistic MDP (LP-MDP). Theoretically, we provide a rigorous analysis on the performance lower bound, which monotonically improves with the lookahead steps. Empirically, with the easy-to-implement design of LP-MDP, we demonstrate a solid performance improvement over baseline methods on widely used offline RL benchmarks.
APA
Wang, C., Ye, Z., Murphy, K. & Chen, Y.. (2024). Don’t Be Pessimistic Too Early: Look K Steps Ahead!. Proceedings of The 27th International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 238:3313-3321 Available from https://proceedings.mlr.press/v238/wang24h.html.

Related Material