MaxMin-RLHF: Alignment with Diverse Human Preferences

Souradip Chakraborty, Jiahao Qiu, Hui Yuan, Alec Koppel, Dinesh Manocha, Furong Huang, Amrit Bedi, Mengdi Wang
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:6116-6135, 2024.

Abstract

Reinforcement Learning from Human Feedback (RLHF) aligns language models to human preferences by employing a singular reward model derived from preference data. However, the single reward model overlooks the rich diversity of human preferences inherent in data collected from multiple users. In this work, we first derive an impossibility result of alignment with single reward RLHF, thereby highlighting its insufficiency in representing diverse human preferences. Next, we propose to learn a mixture of reward models via an expectation-maximization algorithm and solve a MaxMin alignment objective inspired by the Egalitarian principle in social choice theory to better honor diverse human preferences. We present comprehensive experimental results on small-scale (GPT-2) and large-scale language (with Tulu2-7B)) and show the efficacy of the proposed approach in the presence of diversity among human preferences. We remark that our findings in this work are not only limited to language models but also extend to reinforcement learning in general.

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-chakraborty24b, title = {{M}ax{M}in-{RLHF}: Alignment with Diverse Human Preferences}, author = {Chakraborty, Souradip and Qiu, Jiahao and Yuan, Hui and Koppel, Alec and Manocha, Dinesh and Huang, Furong and Bedi, Amrit and Wang, Mengdi}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {6116--6135}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/chakraborty24b/chakraborty24b.pdf}, url = {https://proceedings.mlr.press/v235/chakraborty24b.html}, abstract = {Reinforcement Learning from Human Feedback (RLHF) aligns language models to human preferences by employing a singular reward model derived from preference data. However, the single reward model overlooks the rich diversity of human preferences inherent in data collected from multiple users. In this work, we first derive an impossibility result of alignment with single reward RLHF, thereby highlighting its insufficiency in representing diverse human preferences. Next, we propose to learn a mixture of reward models via an expectation-maximization algorithm and solve a MaxMin alignment objective inspired by the Egalitarian principle in social choice theory to better honor diverse human preferences. We present comprehensive experimental results on small-scale (GPT-2) and large-scale language (with Tulu2-7B)) and show the efficacy of the proposed approach in the presence of diversity among human preferences. We remark that our findings in this work are not only limited to language models but also extend to reinforcement learning in general.} }
Endnote
%0 Conference Paper %T MaxMin-RLHF: Alignment with Diverse Human Preferences %A Souradip Chakraborty %A Jiahao Qiu %A Hui Yuan %A Alec Koppel %A Dinesh Manocha %A Furong Huang %A Amrit Bedi %A Mengdi Wang %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-chakraborty24b %I PMLR %P 6116--6135 %U https://proceedings.mlr.press/v235/chakraborty24b.html %V 235 %X Reinforcement Learning from Human Feedback (RLHF) aligns language models to human preferences by employing a singular reward model derived from preference data. However, the single reward model overlooks the rich diversity of human preferences inherent in data collected from multiple users. In this work, we first derive an impossibility result of alignment with single reward RLHF, thereby highlighting its insufficiency in representing diverse human preferences. Next, we propose to learn a mixture of reward models via an expectation-maximization algorithm and solve a MaxMin alignment objective inspired by the Egalitarian principle in social choice theory to better honor diverse human preferences. We present comprehensive experimental results on small-scale (GPT-2) and large-scale language (with Tulu2-7B)) and show the efficacy of the proposed approach in the presence of diversity among human preferences. We remark that our findings in this work are not only limited to language models but also extend to reinforcement learning in general.
APA
Chakraborty, S., Qiu, J., Yuan, H., Koppel, A., Manocha, D., Huang, F., Bedi, A. & Wang, M.. (2024). MaxMin-RLHF: Alignment with Diverse Human Preferences. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:6116-6135 Available from https://proceedings.mlr.press/v235/chakraborty24b.html.

Related Material