Can RLHF be More Efficient with Imperfect Reward Models? A Policy Coverage Perspective

Jiawei Huang, Bingcong Li, Christoph Dann, Niao He
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:25438-25473, 2025.

Abstract

Sample efficiency is critical for online Reinforcement Learning from Human Feedback (RLHF). While existing works investigate sample-efficient online exploration strategies, the potential of utilizing misspecified yet relevant reward models to accelerate learning remains underexplored. This paper studies how to transfer knowledge from those imperfect reward models in online RLHF. We start by identifying a novel property due to KL-regularization in the RLHF objective: a policy’s coverability of the optimal policy is captured by its sub-optimality. Building on this insight, we propose novel transfer learning principles and a theoretical algorithm—Transfer Policy Optimization (TPO)—with provable benefits compared to standard online learning. Empirically, inspired by our theoretical findings, we develop a win-rate-based transfer policy selection strategy with improved computational efficiency. Moreover, our empirical transfer learning technique is modular and can be integrated with various policy optimization methods, such as DPO, IPO and XPO, to further enhance their performance. We validate the effectiveness of our method through experiments on summarization tasks.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-huang25o, title = {Can {RLHF} be More Efficient with Imperfect Reward Models? {A} Policy Coverage Perspective}, author = {Huang, Jiawei and Li, Bingcong and Dann, Christoph and He, Niao}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {25438--25473}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/huang25o/huang25o.pdf}, url = {https://proceedings.mlr.press/v267/huang25o.html}, abstract = {Sample efficiency is critical for online Reinforcement Learning from Human Feedback (RLHF). While existing works investigate sample-efficient online exploration strategies, the potential of utilizing misspecified yet relevant reward models to accelerate learning remains underexplored. This paper studies how to transfer knowledge from those imperfect reward models in online RLHF. We start by identifying a novel property due to KL-regularization in the RLHF objective: a policy’s coverability of the optimal policy is captured by its sub-optimality. Building on this insight, we propose novel transfer learning principles and a theoretical algorithm—Transfer Policy Optimization (TPO)—with provable benefits compared to standard online learning. Empirically, inspired by our theoretical findings, we develop a win-rate-based transfer policy selection strategy with improved computational efficiency. Moreover, our empirical transfer learning technique is modular and can be integrated with various policy optimization methods, such as DPO, IPO and XPO, to further enhance their performance. We validate the effectiveness of our method through experiments on summarization tasks.} }
Endnote
%0 Conference Paper %T Can RLHF be More Efficient with Imperfect Reward Models? A Policy Coverage Perspective %A Jiawei Huang %A Bingcong Li %A Christoph Dann %A Niao He %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-huang25o %I PMLR %P 25438--25473 %U https://proceedings.mlr.press/v267/huang25o.html %V 267 %X Sample efficiency is critical for online Reinforcement Learning from Human Feedback (RLHF). While existing works investigate sample-efficient online exploration strategies, the potential of utilizing misspecified yet relevant reward models to accelerate learning remains underexplored. This paper studies how to transfer knowledge from those imperfect reward models in online RLHF. We start by identifying a novel property due to KL-regularization in the RLHF objective: a policy’s coverability of the optimal policy is captured by its sub-optimality. Building on this insight, we propose novel transfer learning principles and a theoretical algorithm—Transfer Policy Optimization (TPO)—with provable benefits compared to standard online learning. Empirically, inspired by our theoretical findings, we develop a win-rate-based transfer policy selection strategy with improved computational efficiency. Moreover, our empirical transfer learning technique is modular and can be integrated with various policy optimization methods, such as DPO, IPO and XPO, to further enhance their performance. We validate the effectiveness of our method through experiments on summarization tasks.
APA
Huang, J., Li, B., Dann, C. & He, N.. (2025). Can RLHF be More Efficient with Imperfect Reward Models? A Policy Coverage Perspective. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:25438-25473 Available from https://proceedings.mlr.press/v267/huang25o.html.

Related Material