Learning to Trust Bellman Updates: Selective State-Adaptive Regularization for Offline RL

Qin-Wen Luo; Ming-Kun Xie; Ye-Wen Wang; Sheng-Jun Huang

Learning to Trust Bellman Updates: Selective State-Adaptive Regularization for Offline RL

Qin-Wen Luo, Ming-Kun Xie, Ye-Wen Wang, Sheng-Jun Huang

Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:41450-41467, 2025.

Abstract

Offline reinforcement learning (RL) aims to learn an effective policy from a static dataset. To alleviate extrapolation errors, existing studies often uniformly regularize the value function or policy updates across all states. However, due to substantial variations in data quality, the fixed regularization strength often leads to a dilemma: Weak regularization strength fails to address extrapolation errors and value overestimation, while strong regularization strength shifts policy learning toward behavior cloning, impeding potential performance enabled by Bellman updates. To address this issue, we propose the selective state-adaptive regularization method for offline RL. Specifically, we introduce state-adaptive regularization coefficients to trust state-level Bellman-driven results, while selectively applying regularization on high-quality actions, aiming to avoid performance degradation caused by tight constraints on low-quality actions. By establishing a connection between the representative value regularization method, CQL, and explicit policy constraint methods, we effectively extend selective state-adaptive regularization to these two mainstream offline RL approaches. Extensive experiments demonstrate that the proposed method significantly outperforms the state-of-the-art approaches in both offline and offline-to-online settings on the D4RL benchmark. The implementation is available at https://github.com/QinwenLuo/SSAR.

Cite this Paper

BibTeX

@InProceedings{pmlr-v267-luo25p,
  title = 	 {Learning to Trust {B}ellman Updates: Selective State-Adaptive Regularization for Offline {RL}},
  author =       {Luo, Qin-Wen and Xie, Ming-Kun and Wang, Ye-Wen and Huang, Sheng-Jun},
  booktitle = 	 {Proceedings of the 42nd International Conference on Machine Learning},
  pages = 	 {41450--41467},
  year = 	 {2025},
  editor = 	 {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry},
  volume = 	 {267},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {13--19 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v267/main/assets/luo25p/luo25p.pdf},
  url = 	 {https://proceedings.mlr.press/v267/luo25p.html},
  abstract = 	 {Offline reinforcement learning (RL) aims to learn an effective policy from a static dataset. To alleviate extrapolation errors, existing studies often uniformly regularize the value function or policy updates across all states. However, due to substantial variations in data quality, the fixed regularization strength often leads to a dilemma: Weak regularization strength fails to address extrapolation errors and value overestimation, while strong regularization strength shifts policy learning toward behavior cloning, impeding potential performance enabled by Bellman updates. To address this issue, we propose the selective state-adaptive regularization method for offline RL. Specifically, we introduce state-adaptive regularization coefficients to trust state-level Bellman-driven results, while selectively applying regularization on high-quality actions, aiming to avoid performance degradation caused by tight constraints on low-quality actions. By establishing a connection between the representative value regularization method, CQL, and explicit policy constraint methods, we effectively extend selective state-adaptive regularization to these two mainstream offline RL approaches. Extensive experiments demonstrate that the proposed method significantly outperforms the state-of-the-art approaches in both offline and offline-to-online settings on the D4RL benchmark. The implementation is available at https://github.com/QinwenLuo/SSAR.}
}

Endnote

%0 Conference Paper
%T Learning to Trust Bellman Updates: Selective State-Adaptive Regularization for Offline RL
%A Qin-Wen Luo
%A Ming-Kun Xie
%A Ye-Wen Wang
%A Sheng-Jun Huang
%B Proceedings of the 42nd International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2025
%E Aarti Singh
%E Maryam Fazel
%E Daniel Hsu
%E Simon Lacoste-Julien
%E Felix Berkenkamp
%E Tegan Maharaj
%E Kiri Wagstaff
%E Jerry Zhu	
%F pmlr-v267-luo25p
%I PMLR
%P 41450--41467
%U https://proceedings.mlr.press/v267/luo25p.html
%V 267
%X Offline reinforcement learning (RL) aims to learn an effective policy from a static dataset. To alleviate extrapolation errors, existing studies often uniformly regularize the value function or policy updates across all states. However, due to substantial variations in data quality, the fixed regularization strength often leads to a dilemma: Weak regularization strength fails to address extrapolation errors and value overestimation, while strong regularization strength shifts policy learning toward behavior cloning, impeding potential performance enabled by Bellman updates. To address this issue, we propose the selective state-adaptive regularization method for offline RL. Specifically, we introduce state-adaptive regularization coefficients to trust state-level Bellman-driven results, while selectively applying regularization on high-quality actions, aiming to avoid performance degradation caused by tight constraints on low-quality actions. By establishing a connection between the representative value regularization method, CQL, and explicit policy constraint methods, we effectively extend selective state-adaptive regularization to these two mainstream offline RL approaches. Extensive experiments demonstrate that the proposed method significantly outperforms the state-of-the-art approaches in both offline and offline-to-online settings on the D4RL benchmark. The implementation is available at https://github.com/QinwenLuo/SSAR.

APA

Luo, Q., Xie, M., Wang, Y. & Huang, S.. (2025). Learning to Trust Bellman Updates: Selective State-Adaptive Regularization for Offline RL. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:41450-41467 Available from https://proceedings.mlr.press/v267/luo25p.html.

Learning to Trust Bellman Updates: Selective State-Adaptive Regularization for Offline RL

Abstract

Cite this Paper

Related Material