Generalized Preference Optimization: A Unified Approach to Offline Alignment

Yunhao Tang; Zhaohan Daniel Guo; Zeyu Zheng; Daniele Calandriello; Remi Munos; Mark Rowland; Pierre Harvey Richemond; Michal Valko; Bernardo Avila Pires; Bilal Piot

Generalized Preference Optimization: A Unified Approach to Offline Alignment

Yunhao Tang, Zhaohan Daniel Guo, Zeyu Zheng, Daniele Calandriello, Remi Munos, Mark Rowland, Pierre Harvey Richemond, Michal Valko, Bernardo Avila Pires, Bilal Piot

Proceedings of the 41st International Conference on Machine Learning, PMLR 235:47725-47742, 2024.

Abstract

Offline preference optimization allows fine-tuning large models directly from offline data, and has proved effective in recent alignment practices. We propose generalized preference optimization (GPO), a family of offline losses parameterized by a general class of convex functions. GPO enables a unified view over preference optimization, encompassing existing algorithms such as DPO, IPO and SLiC as special cases, while naturally introducing new variants. The GPO framework also sheds light on how offline algorithms enforce regularization, through the design of the convex function that defines the loss. Our analysis and experiments reveal the connections and subtle differences between the offline regularization and the KL divergence regularization intended by the canonical RLHF formulation. In a controlled setting akin to Gao et al 2023, we also show that different GPO variants achieve similar trade-offs between regularization and performance, though the optimal values of hyper-parameter might differ as predicted by theory. In all, our results present new algorithmic toolkits and empirical insights to alignment practitioners.

Cite this Paper

BibTeX


@InProceedings{pmlr-v235-tang24b,
  title = 	 {Generalized Preference Optimization: A Unified Approach to Offline Alignment},
  author =       {Tang, Yunhao and Guo, Zhaohan Daniel and Zheng, Zeyu and Calandriello, Daniele and Munos, Remi and Rowland, Mark and Richemond, Pierre Harvey and Valko, Michal and Avila Pires, Bernardo and Piot, Bilal},
  booktitle = 	 {Proceedings of the 41st International Conference on Machine Learning},
  pages = 	 {47725--47742},
  year = 	 {2024},
  editor = 	 {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix},
  volume = 	 {235},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {21--27 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v235/main/assets/tang24b/tang24b.pdf},
  url = 	 {https://proceedings.mlr.press/v235/tang24b.html},
  abstract = 	 {Offline preference optimization allows fine-tuning large models directly from offline data, and has proved effective in recent alignment practices. We propose generalized preference optimization (GPO), a family of offline losses parameterized by a general class of convex functions. GPO enables a unified view over preference optimization, encompassing existing algorithms such as DPO, IPO and SLiC as special cases, while naturally introducing new variants. The GPO framework also sheds light on how offline algorithms enforce regularization, through the design of the convex function that defines the loss. Our analysis and experiments reveal the connections and subtle differences between the offline regularization and the KL divergence regularization intended by the canonical RLHF formulation. In a controlled setting akin to Gao et al 2023, we also show that different GPO variants achieve similar trade-offs between regularization and performance, though the optimal values of hyper-parameter might differ as predicted by theory. In all, our results present new algorithmic toolkits and empirical insights to alignment practitioners.}
}

Endnote

%0 Conference Paper
%T Generalized Preference Optimization: A Unified Approach to Offline Alignment
%A Yunhao Tang
%A Zhaohan Daniel Guo
%A Zeyu Zheng
%A Daniele Calandriello
%A Remi Munos
%A Mark Rowland
%A Pierre Harvey Richemond
%A Michal Valko
%A Bernardo Avila Pires
%A Bilal Piot
%B Proceedings of the 41st International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2024
%E Ruslan Salakhutdinov
%E Zico Kolter
%E Katherine Heller
%E Adrian Weller
%E Nuria Oliver
%E Jonathan Scarlett
%E Felix Berkenkamp	
%F pmlr-v235-tang24b
%I PMLR
%P 47725--47742
%U https://proceedings.mlr.press/v235/tang24b.html
%V 235
%X Offline preference optimization allows fine-tuning large models directly from offline data, and has proved effective in recent alignment practices. We propose generalized preference optimization (GPO), a family of offline losses parameterized by a general class of convex functions. GPO enables a unified view over preference optimization, encompassing existing algorithms such as DPO, IPO and SLiC as special cases, while naturally introducing new variants. The GPO framework also sheds light on how offline algorithms enforce regularization, through the design of the convex function that defines the loss. Our analysis and experiments reveal the connections and subtle differences between the offline regularization and the KL divergence regularization intended by the canonical RLHF formulation. In a controlled setting akin to Gao et al 2023, we also show that different GPO variants achieve similar trade-offs between regularization and performance, though the optimal values of hyper-parameter might differ as predicted by theory. In all, our results present new algorithmic toolkits and empirical insights to alignment practitioners.

APA


Tang, Y., Guo, Z.D., Zheng, Z., Calandriello, D., Munos, R., Rowland, M., Richemond, P.H., Valko, M., Avila Pires, B. & Piot, B.. (2024). Generalized Preference Optimization: A Unified Approach to Offline Alignment. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:47725-47742 Available from https://proceedings.mlr.press/v235/tang24b.html.

Generalized Preference Optimization: A Unified Approach to Offline Alignment

Abstract

Cite this Paper

Related Material