Common Learning Constraints Alter Interpretations of Direct Preference Optimization

Lemin Kong, Xiangkun Hu, Tong He, David Wipf
Proceedings of The 28th International Conference on Artificial Intelligence and Statistics, PMLR 258:2215-2223, 2025.

Abstract

Large language models in the past have typically relied on some form of reinforcement learning with human feedback (RLHF) to better align model responses with human preferences. However, because of oft-observed instabilities when implementing these RLHF pipelines, various reparameterization techniques have recently been introduced to sidestep the need for separately learning an RL reward model. Instead, directly fine-tuning for human preferences is achieved via the minimization of a single closed-form training objective, a process originally referred to as direct preference optimization (DPO). Although effective in certain real-world settings, we detail how the foundational role of DPO reparameterizations (and equivalency to applying RLHF with an optimal reward) may be obfuscated once inevitable optimization constraints are introduced during model training. This then motivates alternative derivations and analysis of DPO that remain intact even in the presence of such constraints. As initial steps in this direction, we re-derive DPO from a simple Gaussian estimation perspective, with strong ties to compressive sensing and classical constrained optimization problems involving noise-adaptive, concave regularization.

Cite this Paper


BibTeX
@InProceedings{pmlr-v258-kong25a, title = {Common Learning Constraints Alter Interpretations of Direct Preference Optimization}, author = {Kong, Lemin and Hu, Xiangkun and He, Tong and Wipf, David}, booktitle = {Proceedings of The 28th International Conference on Artificial Intelligence and Statistics}, pages = {2215--2223}, year = {2025}, editor = {Li, Yingzhen and Mandt, Stephan and Agrawal, Shipra and Khan, Emtiyaz}, volume = {258}, series = {Proceedings of Machine Learning Research}, month = {03--05 May}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v258/main/assets/kong25a/kong25a.pdf}, url = {https://proceedings.mlr.press/v258/kong25a.html}, abstract = {Large language models in the past have typically relied on some form of reinforcement learning with human feedback (RLHF) to better align model responses with human preferences. However, because of oft-observed instabilities when implementing these RLHF pipelines, various reparameterization techniques have recently been introduced to sidestep the need for separately learning an RL reward model. Instead, directly fine-tuning for human preferences is achieved via the minimization of a single closed-form training objective, a process originally referred to as direct preference optimization (DPO). Although effective in certain real-world settings, we detail how the foundational role of DPO reparameterizations (and equivalency to applying RLHF with an optimal reward) may be obfuscated once inevitable optimization constraints are introduced during model training. This then motivates alternative derivations and analysis of DPO that remain intact even in the presence of such constraints. As initial steps in this direction, we re-derive DPO from a simple Gaussian estimation perspective, with strong ties to compressive sensing and classical constrained optimization problems involving noise-adaptive, concave regularization.} }
Endnote
%0 Conference Paper %T Common Learning Constraints Alter Interpretations of Direct Preference Optimization %A Lemin Kong %A Xiangkun Hu %A Tong He %A David Wipf %B Proceedings of The 28th International Conference on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 2025 %E Yingzhen Li %E Stephan Mandt %E Shipra Agrawal %E Emtiyaz Khan %F pmlr-v258-kong25a %I PMLR %P 2215--2223 %U https://proceedings.mlr.press/v258/kong25a.html %V 258 %X Large language models in the past have typically relied on some form of reinforcement learning with human feedback (RLHF) to better align model responses with human preferences. However, because of oft-observed instabilities when implementing these RLHF pipelines, various reparameterization techniques have recently been introduced to sidestep the need for separately learning an RL reward model. Instead, directly fine-tuning for human preferences is achieved via the minimization of a single closed-form training objective, a process originally referred to as direct preference optimization (DPO). Although effective in certain real-world settings, we detail how the foundational role of DPO reparameterizations (and equivalency to applying RLHF with an optimal reward) may be obfuscated once inevitable optimization constraints are introduced during model training. This then motivates alternative derivations and analysis of DPO that remain intact even in the presence of such constraints. As initial steps in this direction, we re-derive DPO from a simple Gaussian estimation perspective, with strong ties to compressive sensing and classical constrained optimization problems involving noise-adaptive, concave regularization.
APA
Kong, L., Hu, X., He, T. & Wipf, D.. (2025). Common Learning Constraints Alter Interpretations of Direct Preference Optimization. Proceedings of The 28th International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 258:2215-2223 Available from https://proceedings.mlr.press/v258/kong25a.html.

Related Material