Reinforcement Learning with Adaptive Reward Modeling for Expensive-to-Evaluate Systems

Hongyuan Su, Yu Zheng, Yuan Yuan, Yuming Lin, Depeng Jin, Yong Li
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:57131-57143, 2025.

Abstract

Training reinforcement learning (RL) agents requires extensive trials and errors, which becomes prohibitively time-consuming in systems with costly reward evaluations. To address this challenge, we propose adaptive reward modeling (AdaReMo) which accelerates RL training by decomposing the complicated reward function into multiple localized fast reward models approximating direct reward evaluation with neural networks. These models dynamically adapt to the agent’s evolving policy by fitting the currently explored subspace with the latest trajectories, ensuring accurate reward estimation throughout the entire training process while significantly reducing computational overhead. We empirically show that AdaReMo not only achieves over 1,000 times speedup but also improves the performance by 14.6% over state-of-the-art approaches across three expensive-to-evaluate systems—molecular generation, epidemic control, and spatial planning. Code and data for the project are provided at https://github.com/tsinghua-fib-lab/AdaReMo.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-su25f, title = {Reinforcement Learning with Adaptive Reward Modeling for Expensive-to-Evaluate Systems}, author = {Su, Hongyuan and Zheng, Yu and Yuan, Yuan and Lin, Yuming and Jin, Depeng and Li, Yong}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {57131--57143}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/su25f/su25f.pdf}, url = {https://proceedings.mlr.press/v267/su25f.html}, abstract = {Training reinforcement learning (RL) agents requires extensive trials and errors, which becomes prohibitively time-consuming in systems with costly reward evaluations. To address this challenge, we propose adaptive reward modeling (AdaReMo) which accelerates RL training by decomposing the complicated reward function into multiple localized fast reward models approximating direct reward evaluation with neural networks. These models dynamically adapt to the agent’s evolving policy by fitting the currently explored subspace with the latest trajectories, ensuring accurate reward estimation throughout the entire training process while significantly reducing computational overhead. We empirically show that AdaReMo not only achieves over 1,000 times speedup but also improves the performance by 14.6% over state-of-the-art approaches across three expensive-to-evaluate systems—molecular generation, epidemic control, and spatial planning. Code and data for the project are provided at https://github.com/tsinghua-fib-lab/AdaReMo.} }
Endnote
%0 Conference Paper %T Reinforcement Learning with Adaptive Reward Modeling for Expensive-to-Evaluate Systems %A Hongyuan Su %A Yu Zheng %A Yuan Yuan %A Yuming Lin %A Depeng Jin %A Yong Li %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-su25f %I PMLR %P 57131--57143 %U https://proceedings.mlr.press/v267/su25f.html %V 267 %X Training reinforcement learning (RL) agents requires extensive trials and errors, which becomes prohibitively time-consuming in systems with costly reward evaluations. To address this challenge, we propose adaptive reward modeling (AdaReMo) which accelerates RL training by decomposing the complicated reward function into multiple localized fast reward models approximating direct reward evaluation with neural networks. These models dynamically adapt to the agent’s evolving policy by fitting the currently explored subspace with the latest trajectories, ensuring accurate reward estimation throughout the entire training process while significantly reducing computational overhead. We empirically show that AdaReMo not only achieves over 1,000 times speedup but also improves the performance by 14.6% over state-of-the-art approaches across three expensive-to-evaluate systems—molecular generation, epidemic control, and spatial planning. Code and data for the project are provided at https://github.com/tsinghua-fib-lab/AdaReMo.
APA
Su, H., Zheng, Y., Yuan, Y., Lin, Y., Jin, D. & Li, Y.. (2025). Reinforcement Learning with Adaptive Reward Modeling for Expensive-to-Evaluate Systems. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:57131-57143 Available from https://proceedings.mlr.press/v267/su25f.html.

Related Material