[edit]
Policy-labeled Preference Learning: Is Preference Enough for RLHF?
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:10524-10553, 2025.
Abstract
To design reward that align with human goals, Reinforcement Learning from Human Feedback (RLHF) has emerged as a prominent technique for learning reward functions from human preferences and optimizing models using reinforcement learning algorithms. However, existing RLHF methods often misinterpret trajectories as being generated by an optimal policy, causing inaccurate likelihood estimation and suboptimal learning. To address this, we propose Policy-labeled Preference Learning (PPL) within the Direct Preference Optimization (DPO) framework, which resolves these likelihood mismatch problems by modeling human preferences with regret, reflecting the efficiency of executed policies. Additionally, we introduce a contrastive KL regularization term derived from regret-based principles to enhance sequential contrastive learning. Experiments in high-dimensional continuous control environments demonstrate PPL’s significant improvements in offline RLHF performance and its effectiveness in online settings.