Policy-labeled Preference Learning: Is Preference Enough for RLHF?

Taehyun Cho, Seokhun Ju, Seungyub Han, Dohyeong Kim, Kyungjae Lee, Jungwoo Lee
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:10524-10553, 2025.

Abstract

To design reward that align with human goals, Reinforcement Learning from Human Feedback (RLHF) has emerged as a prominent technique for learning reward functions from human preferences and optimizing models using reinforcement learning algorithms. However, existing RLHF methods often misinterpret trajectories as being generated by an optimal policy, causing inaccurate likelihood estimation and suboptimal learning. To address this, we propose Policy-labeled Preference Learning (PPL) within the Direct Preference Optimization (DPO) framework, which resolves these likelihood mismatch problems by modeling human preferences with regret, reflecting the efficiency of executed policies. Additionally, we introduce a contrastive KL regularization term derived from regret-based principles to enhance sequential contrastive learning. Experiments in high-dimensional continuous control environments demonstrate PPL’s significant improvements in offline RLHF performance and its effectiveness in online settings.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-cho25b, title = {Policy-labeled Preference Learning: Is Preference Enough for {RLHF}?}, author = {Cho, Taehyun and Ju, Seokhun and Han, Seungyub and Kim, Dohyeong and Lee, Kyungjae and Lee, Jungwoo}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {10524--10553}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/cho25b/cho25b.pdf}, url = {https://proceedings.mlr.press/v267/cho25b.html}, abstract = {To design reward that align with human goals, Reinforcement Learning from Human Feedback (RLHF) has emerged as a prominent technique for learning reward functions from human preferences and optimizing models using reinforcement learning algorithms. However, existing RLHF methods often misinterpret trajectories as being generated by an optimal policy, causing inaccurate likelihood estimation and suboptimal learning. To address this, we propose Policy-labeled Preference Learning (PPL) within the Direct Preference Optimization (DPO) framework, which resolves these likelihood mismatch problems by modeling human preferences with regret, reflecting the efficiency of executed policies. Additionally, we introduce a contrastive KL regularization term derived from regret-based principles to enhance sequential contrastive learning. Experiments in high-dimensional continuous control environments demonstrate PPL’s significant improvements in offline RLHF performance and its effectiveness in online settings.} }
Endnote
%0 Conference Paper %T Policy-labeled Preference Learning: Is Preference Enough for RLHF? %A Taehyun Cho %A Seokhun Ju %A Seungyub Han %A Dohyeong Kim %A Kyungjae Lee %A Jungwoo Lee %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-cho25b %I PMLR %P 10524--10553 %U https://proceedings.mlr.press/v267/cho25b.html %V 267 %X To design reward that align with human goals, Reinforcement Learning from Human Feedback (RLHF) has emerged as a prominent technique for learning reward functions from human preferences and optimizing models using reinforcement learning algorithms. However, existing RLHF methods often misinterpret trajectories as being generated by an optimal policy, causing inaccurate likelihood estimation and suboptimal learning. To address this, we propose Policy-labeled Preference Learning (PPL) within the Direct Preference Optimization (DPO) framework, which resolves these likelihood mismatch problems by modeling human preferences with regret, reflecting the efficiency of executed policies. Additionally, we introduce a contrastive KL regularization term derived from regret-based principles to enhance sequential contrastive learning. Experiments in high-dimensional continuous control environments demonstrate PPL’s significant improvements in offline RLHF performance and its effectiveness in online settings.
APA
Cho, T., Ju, S., Han, S., Kim, D., Lee, K. & Lee, J.. (2025). Policy-labeled Preference Learning: Is Preference Enough for RLHF?. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:10524-10553 Available from https://proceedings.mlr.press/v267/cho25b.html.

Related Material