AMPO: Active Multi Preference Optimization for Self-play Preference Selection

Taneesh Gupta, Rahul Madhavan, Xuchao Zhang, Chetan Bansal, Saravan Rajmohan
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:21342-21368, 2025.

Abstract

Multi-preference optimization enriches language-model alignment beyond pairwise preferences by contrasting entire sets of helpful and undesired responses, enabling richer training signals for large language models. During self-play alignment, these models often produce numerous candidate answers per query, making it computationally infeasible to include all of them in the training objective. We propose Active Multi-Preference Optimization (AMPO), which combines on-policy generation, a multi-preference group-contrastive loss, and active subset selection. Specifically, we score and embed large candidate pools of responses, then pick a small but informative subset—covering reward extremes and distinct semantic clusters—for preference optimization.The resulting contrastive-training scheme identifies not only the best and worst answers but also subtle, underexplored modes crucial for robust alignment. Theoretically, we provide guarantees of expected reward maximization using our active selection method. Empirically, AMPO achieves state-of-the-art results on AlpacaEval with Llama 8B and Mistral 7B. We release our datasets here.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-gupta25c, title = {{AMPO}: Active Multi Preference Optimization for Self-play Preference Selection}, author = {Gupta, Taneesh and Madhavan, Rahul and Zhang, Xuchao and Bansal, Chetan and Rajmohan, Saravan}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {21342--21368}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/gupta25c/gupta25c.pdf}, url = {https://proceedings.mlr.press/v267/gupta25c.html}, abstract = {Multi-preference optimization enriches language-model alignment beyond pairwise preferences by contrasting entire sets of helpful and undesired responses, enabling richer training signals for large language models. During self-play alignment, these models often produce numerous candidate answers per query, making it computationally infeasible to include all of them in the training objective. We propose Active Multi-Preference Optimization (AMPO), which combines on-policy generation, a multi-preference group-contrastive loss, and active subset selection. Specifically, we score and embed large candidate pools of responses, then pick a small but informative subset—covering reward extremes and distinct semantic clusters—for preference optimization.The resulting contrastive-training scheme identifies not only the best and worst answers but also subtle, underexplored modes crucial for robust alignment. Theoretically, we provide guarantees of expected reward maximization using our active selection method. Empirically, AMPO achieves state-of-the-art results on AlpacaEval with Llama 8B and Mistral 7B. We release our datasets here.} }
Endnote
%0 Conference Paper %T AMPO: Active Multi Preference Optimization for Self-play Preference Selection %A Taneesh Gupta %A Rahul Madhavan %A Xuchao Zhang %A Chetan Bansal %A Saravan Rajmohan %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-gupta25c %I PMLR %P 21342--21368 %U https://proceedings.mlr.press/v267/gupta25c.html %V 267 %X Multi-preference optimization enriches language-model alignment beyond pairwise preferences by contrasting entire sets of helpful and undesired responses, enabling richer training signals for large language models. During self-play alignment, these models often produce numerous candidate answers per query, making it computationally infeasible to include all of them in the training objective. We propose Active Multi-Preference Optimization (AMPO), which combines on-policy generation, a multi-preference group-contrastive loss, and active subset selection. Specifically, we score and embed large candidate pools of responses, then pick a small but informative subset—covering reward extremes and distinct semantic clusters—for preference optimization.The resulting contrastive-training scheme identifies not only the best and worst answers but also subtle, underexplored modes crucial for robust alignment. Theoretically, we provide guarantees of expected reward maximization using our active selection method. Empirically, AMPO achieves state-of-the-art results on AlpacaEval with Llama 8B and Mistral 7B. We release our datasets here.
APA
Gupta, T., Madhavan, R., Zhang, X., Bansal, C. & Rajmohan, S.. (2025). AMPO: Active Multi Preference Optimization for Self-play Preference Selection. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:21342-21368 Available from https://proceedings.mlr.press/v267/gupta25c.html.

Related Material