Penalizing Infeasible Actions and Reward Scaling in Reinforcement Learning with Offline Data

Jeonghye Kim; Yongjae Shin; Whiyoung Jung; Sunghoon Hong; Deunsol Yoon; Youngchul Sung; Kanghoon Lee; Woohyung Lim

Penalizing Infeasible Actions and Reward Scaling in Reinforcement Learning with Offline Data

Jeonghye Kim, Yongjae Shin, Whiyoung Jung, Sunghoon Hong, Deunsol Yoon, Youngchul Sung, Kanghoon Lee, Woohyung Lim

Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:30769-30790, 2025.

Abstract

Reinforcement learning with offline data suffers from Q-value extrapolation errors. To address this issue, we first demonstrate that linear extrapolation of the Q-function beyond the data range is particularly problematic. To mitigate this, we propose guiding the gradual decrease of Q-values outside the data range, which is achieved through reward scaling with layer normalization (RS-LN) and a penalization mechanism for infeasible actions (PA). By combining RS-LN and PA, we develop a new algorithm called PARS. We evaluate PARS across a range of tasks, demonstrating superior performance compared to state-of-the-art algorithms in both offline training and online fine-tuning on the D4RL benchmark, with notable success in the challenging AntMaze Ultra task.

Cite this Paper

BibTeX

@InProceedings{pmlr-v267-kim25ai,
  title = 	 {Penalizing Infeasible Actions and Reward Scaling in Reinforcement Learning with Offline Data},
  author =       {Kim, Jeonghye and Shin, Yongjae and Jung, Whiyoung and Hong, Sunghoon and Yoon, Deunsol and Sung, Youngchul and Lee, Kanghoon and Lim, Woohyung},
  booktitle = 	 {Proceedings of the 42nd International Conference on Machine Learning},
  pages = 	 {30769--30790},
  year = 	 {2025},
  editor = 	 {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry},
  volume = 	 {267},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {13--19 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v267/main/assets/kim25ai/kim25ai.pdf},
  url = 	 {https://proceedings.mlr.press/v267/kim25ai.html},
  abstract = 	 {Reinforcement learning with offline data suffers from Q-value extrapolation errors. To address this issue, we first demonstrate that linear extrapolation of the Q-function beyond the data range is particularly problematic. To mitigate this, we propose guiding the gradual decrease of Q-values outside the data range, which is achieved through reward scaling with layer normalization (RS-LN) and a penalization mechanism for infeasible actions (PA). By combining RS-LN and PA, we develop a new algorithm called PARS. We evaluate PARS across a range of tasks, demonstrating superior performance compared to state-of-the-art algorithms in both offline training and online fine-tuning on the D4RL benchmark, with notable success in the challenging AntMaze Ultra task.}
}

Endnote

%0 Conference Paper
%T Penalizing Infeasible Actions and Reward Scaling in Reinforcement Learning with Offline Data
%A Jeonghye Kim
%A Yongjae Shin
%A Whiyoung Jung
%A Sunghoon Hong
%A Deunsol Yoon
%A Youngchul Sung
%A Kanghoon Lee
%A Woohyung Lim
%B Proceedings of the 42nd International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2025
%E Aarti Singh
%E Maryam Fazel
%E Daniel Hsu
%E Simon Lacoste-Julien
%E Felix Berkenkamp
%E Tegan Maharaj
%E Kiri Wagstaff
%E Jerry Zhu	
%F pmlr-v267-kim25ai
%I PMLR
%P 30769--30790
%U https://proceedings.mlr.press/v267/kim25ai.html
%V 267
%X Reinforcement learning with offline data suffers from Q-value extrapolation errors. To address this issue, we first demonstrate that linear extrapolation of the Q-function beyond the data range is particularly problematic. To mitigate this, we propose guiding the gradual decrease of Q-values outside the data range, which is achieved through reward scaling with layer normalization (RS-LN) and a penalization mechanism for infeasible actions (PA). By combining RS-LN and PA, we develop a new algorithm called PARS. We evaluate PARS across a range of tasks, demonstrating superior performance compared to state-of-the-art algorithms in both offline training and online fine-tuning on the D4RL benchmark, with notable success in the challenging AntMaze Ultra task.

APA

Kim, J., Shin, Y., Jung, W., Hong, S., Yoon, D., Sung, Y., Lee, K. & Lim, W.. (2025). Penalizing Infeasible Actions and Reward Scaling in Reinforcement Learning with Offline Data. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:30769-30790 Available from https://proceedings.mlr.press/v267/kim25ai.html.

Penalizing Infeasible Actions and Reward Scaling in Reinforcement Learning with Offline Data

Abstract

Cite this Paper

Related Material