Don’t Be Pessimistic Too Early: Look K Steps Ahead!

Chaoqi Wang; Ziyu Ye; Kevin Murphy; Yuxin Chen

Don’t Be Pessimistic Too Early: Look K Steps Ahead!

Chaoqi Wang, Ziyu Ye, Kevin Murphy, Yuxin Chen

Proceedings of The 27th International Conference on Artificial Intelligence and Statistics, PMLR 238:3313-3321, 2024.

Abstract

Offline reinforcement learning (RL) considers to train highly rewarding policies exclusively from existing data, showing great real-world impacts. Pessimism, \emph{i.e.}, avoiding uncertain states or actions during decision making, has long been the main theme for offline RL. However, existing works often lead to overly conservative policies with rather sub-optimal performance. To tackle this challenge, we introduce the notion of \emph{lookahead pessimism} within the model-based offline RL paradigm. Intuitively, while the classical pessimism principle asks to terminate whenever the RL agent reaches an uncertain region, our method allows the agent to use a lookahead set carefully crafted from the learned model, and to make a move by properties of the lookahead set. Remarkably, we show that this enables learning a less conservative policy with a better performance guarantee. We refer to our method as Lookahead Pessimistic MDP (LP-MDP). Theoretically, we provide a rigorous analysis on the performance lower bound, which monotonically improves with the lookahead steps. Empirically, with the easy-to-implement design of LP-MDP, we demonstrate a solid performance improvement over baseline methods on widely used offline RL benchmarks.

Cite this Paper

BibTeX

@InProceedings{pmlr-v238-wang24h,
  title = 	 {Don’t Be Pessimistic Too Early: Look {K} Steps Ahead!},
  author =       {Wang, Chaoqi and Ye, Ziyu and Murphy, Kevin and Chen, Yuxin},
  booktitle = 	 {Proceedings of The 27th International Conference on Artificial Intelligence and Statistics},
  pages = 	 {3313--3321},
  year = 	 {2024},
  editor = 	 {Dasgupta, Sanjoy and Mandt, Stephan and Li, Yingzhen},
  volume = 	 {238},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {02--04 May},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v238/wang24h/wang24h.pdf},
  url = 	 {https://proceedings.mlr.press/v238/wang24h.html},
  abstract = 	 {Offline reinforcement learning (RL) considers to train highly rewarding policies exclusively from existing data, showing great real-world impacts. Pessimism, \emph{i.e.}, avoiding uncertain states or actions during decision making, has long been the main theme for offline RL. However, existing works often lead to overly conservative policies with rather sub-optimal performance. To tackle this challenge, we introduce the notion of \emph{lookahead pessimism} within the model-based offline RL paradigm. Intuitively, while the classical pessimism principle asks to terminate whenever the RL agent reaches an uncertain region, our method allows the agent to use a lookahead set carefully crafted from the learned model, and to make a move by properties of the lookahead set. Remarkably, we show that this enables learning a less conservative policy with a better performance guarantee. We refer to our method as Lookahead Pessimistic MDP (LP-MDP). Theoretically, we provide a rigorous analysis on the performance lower bound, which monotonically improves with the lookahead steps. Empirically, with the easy-to-implement design of LP-MDP, we demonstrate a solid performance improvement over baseline methods on widely used offline RL benchmarks.}
}

Endnote

%0 Conference Paper
%T Don’t Be Pessimistic Too Early: Look K Steps Ahead!
%A Chaoqi Wang
%A Ziyu Ye
%A Kevin Murphy
%A Yuxin Chen
%B Proceedings of The 27th International Conference on Artificial Intelligence and Statistics
%C Proceedings of Machine Learning Research
%D 2024
%E Sanjoy Dasgupta
%E Stephan Mandt
%E Yingzhen Li	
%F pmlr-v238-wang24h
%I PMLR
%P 3313--3321
%U https://proceedings.mlr.press/v238/wang24h.html
%V 238
%X Offline reinforcement learning (RL) considers to train highly rewarding policies exclusively from existing data, showing great real-world impacts. Pessimism, \emph{i.e.}, avoiding uncertain states or actions during decision making, has long been the main theme for offline RL. However, existing works often lead to overly conservative policies with rather sub-optimal performance. To tackle this challenge, we introduce the notion of \emph{lookahead pessimism} within the model-based offline RL paradigm. Intuitively, while the classical pessimism principle asks to terminate whenever the RL agent reaches an uncertain region, our method allows the agent to use a lookahead set carefully crafted from the learned model, and to make a move by properties of the lookahead set. Remarkably, we show that this enables learning a less conservative policy with a better performance guarantee. We refer to our method as Lookahead Pessimistic MDP (LP-MDP). Theoretically, we provide a rigorous analysis on the performance lower bound, which monotonically improves with the lookahead steps. Empirically, with the easy-to-implement design of LP-MDP, we demonstrate a solid performance improvement over baseline methods on widely used offline RL benchmarks.

APA

Wang, C., Ye, Z., Murphy, K. & Chen, Y.. (2024). Don’t Be Pessimistic Too Early: Look K Steps Ahead!. Proceedings of The 27th International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 238:3313-3321 Available from https://proceedings.mlr.press/v238/wang24h.html.

Don’t Be Pessimistic Too Early: Look K Steps Ahead!

Abstract

Cite this Paper

Related Material