Human Alignment of Large Language Models through Online Preference Optimisation

Daniele Calandriello, Zhaohan Daniel Guo, Remi Munos, Mark Rowland, Yunhao Tang, Bernardo Avila Pires, Pierre Harvey Richemond, Charline Le Lan, Michal Valko, Tianqi Liu, Rishabh Joshi, Zeyu Zheng, Bilal Piot
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:5409-5435, 2024.

Abstract

Ensuring alignment of language model’s outputs with human preferences is critical to guarantee a useful, safe, and pleasant user experience. Thus, human alignment has been extensively studied recently and several methods such as Reinforcement Learning from Human Feedback (RLHF), Direct Policy Optimisation (DPO) and Sequence Likelihood Calibration (SLiC) have emerged. In this paper, our contribution is two-fold. First, we show the equivalence between two recent alignment methods, namely Identity Policy Optimisation (IPO) and Nash Mirror Descent (Nash-MD). Second, we introduce a generalisation of IPO, named IPO-MD, that leverages the regularised sampling approach proposed by Nash-MD. This equivalence may seem surprising at first sight, since IPO is an offline method whereas Nash-MD is an online method using a preference model. However, this equivalence can be proven when we consider the online version of IPO, that is when both generations are sampled by the online policy and annotated by a trained preference model. Optimising the IPO loss with such a stream of data becomes then equivalent to finding the Nash equilibrium of the preference model through self-play. Building on this equivalence, we introduce the IPO-MD algorithm that generates data with a mixture policy (between the online and reference policy) similarly as the general Nash-MD algorithm. We compare online-IPO and IPO-MD to different online versions of existing losses on preference data such as DPO and SLiC on a summarisation task.

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-calandriello24a, title = {Human Alignment of Large Language Models through Online Preference Optimisation}, author = {Calandriello, Daniele and Guo, Zhaohan Daniel and Munos, Remi and Rowland, Mark and Tang, Yunhao and Avila Pires, Bernardo and Richemond, Pierre Harvey and Le Lan, Charline and Valko, Michal and Liu, Tianqi and Joshi, Rishabh and Zheng, Zeyu and Piot, Bilal}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {5409--5435}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/calandriello24a/calandriello24a.pdf}, url = {https://proceedings.mlr.press/v235/calandriello24a.html}, abstract = {Ensuring alignment of language model’s outputs with human preferences is critical to guarantee a useful, safe, and pleasant user experience. Thus, human alignment has been extensively studied recently and several methods such as Reinforcement Learning from Human Feedback (RLHF), Direct Policy Optimisation (DPO) and Sequence Likelihood Calibration (SLiC) have emerged. In this paper, our contribution is two-fold. First, we show the equivalence between two recent alignment methods, namely Identity Policy Optimisation (IPO) and Nash Mirror Descent (Nash-MD). Second, we introduce a generalisation of IPO, named IPO-MD, that leverages the regularised sampling approach proposed by Nash-MD. This equivalence may seem surprising at first sight, since IPO is an offline method whereas Nash-MD is an online method using a preference model. However, this equivalence can be proven when we consider the online version of IPO, that is when both generations are sampled by the online policy and annotated by a trained preference model. Optimising the IPO loss with such a stream of data becomes then equivalent to finding the Nash equilibrium of the preference model through self-play. Building on this equivalence, we introduce the IPO-MD algorithm that generates data with a mixture policy (between the online and reference policy) similarly as the general Nash-MD algorithm. We compare online-IPO and IPO-MD to different online versions of existing losses on preference data such as DPO and SLiC on a summarisation task.} }
Endnote
%0 Conference Paper %T Human Alignment of Large Language Models through Online Preference Optimisation %A Daniele Calandriello %A Zhaohan Daniel Guo %A Remi Munos %A Mark Rowland %A Yunhao Tang %A Bernardo Avila Pires %A Pierre Harvey Richemond %A Charline Le Lan %A Michal Valko %A Tianqi Liu %A Rishabh Joshi %A Zeyu Zheng %A Bilal Piot %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-calandriello24a %I PMLR %P 5409--5435 %U https://proceedings.mlr.press/v235/calandriello24a.html %V 235 %X Ensuring alignment of language model’s outputs with human preferences is critical to guarantee a useful, safe, and pleasant user experience. Thus, human alignment has been extensively studied recently and several methods such as Reinforcement Learning from Human Feedback (RLHF), Direct Policy Optimisation (DPO) and Sequence Likelihood Calibration (SLiC) have emerged. In this paper, our contribution is two-fold. First, we show the equivalence between two recent alignment methods, namely Identity Policy Optimisation (IPO) and Nash Mirror Descent (Nash-MD). Second, we introduce a generalisation of IPO, named IPO-MD, that leverages the regularised sampling approach proposed by Nash-MD. This equivalence may seem surprising at first sight, since IPO is an offline method whereas Nash-MD is an online method using a preference model. However, this equivalence can be proven when we consider the online version of IPO, that is when both generations are sampled by the online policy and annotated by a trained preference model. Optimising the IPO loss with such a stream of data becomes then equivalent to finding the Nash equilibrium of the preference model through self-play. Building on this equivalence, we introduce the IPO-MD algorithm that generates data with a mixture policy (between the online and reference policy) similarly as the general Nash-MD algorithm. We compare online-IPO and IPO-MD to different online versions of existing losses on preference data such as DPO and SLiC on a summarisation task.
APA
Calandriello, D., Guo, Z.D., Munos, R., Rowland, M., Tang, Y., Avila Pires, B., Richemond, P.H., Le Lan, C., Valko, M., Liu, T., Joshi, R., Zheng, Z. & Piot, B.. (2024). Human Alignment of Large Language Models through Online Preference Optimisation. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:5409-5435 Available from https://proceedings.mlr.press/v235/calandriello24a.html.

Related Material