Comparing Few to Rank Many: Active Human Preference Learning Using Randomized Frank-Wolfe Method

Kiran Koshy Thekumparampil, Gaurush Hiranandani, Kousha Kalantari, Shoham Sabach, Branislav Kveton
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:59355-59376, 2025.

Abstract

We study learning human preferences from limited comparison feedback, a core machine learning problem that is at the center of reinforcement learning from human feedback (RLHF). We formulate the problem as learning a Plackett-Luce (PL) model from a limited number of $K$-subset comparisons over a universe of $N$ items, where typically $K \ll N$. Our objective is to select the $K$-subsets such that all items can be ranked with minimal mistakes within the budget. We solve the problem using the D-optimal design, which minimizes the worst-case ranking loss under the estimated PL model. All known algorithms for this problem are computationally infeasible in our setting because we consider exponentially many subsets in $K$. To address this challenge, we propose a randomized Frank-Wolfe algorithm with memoization and sparse updates that has a low $O(N^2 + K^2)$ per-iteration complexity. We analyze it and demonstrate its empirical superiority on synthetic and open-source NLP datasets.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-thekumparampil25a, title = {Comparing Few to Rank Many: Active Human Preference Learning Using Randomized Frank-{W}olfe Method}, author = {Thekumparampil, Kiran Koshy and Hiranandani, Gaurush and Kalantari, Kousha and Sabach, Shoham and Kveton, Branislav}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {59355--59376}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/thekumparampil25a/thekumparampil25a.pdf}, url = {https://proceedings.mlr.press/v267/thekumparampil25a.html}, abstract = {We study learning human preferences from limited comparison feedback, a core machine learning problem that is at the center of reinforcement learning from human feedback (RLHF). We formulate the problem as learning a Plackett-Luce (PL) model from a limited number of $K$-subset comparisons over a universe of $N$ items, where typically $K \ll N$. Our objective is to select the $K$-subsets such that all items can be ranked with minimal mistakes within the budget. We solve the problem using the D-optimal design, which minimizes the worst-case ranking loss under the estimated PL model. All known algorithms for this problem are computationally infeasible in our setting because we consider exponentially many subsets in $K$. To address this challenge, we propose a randomized Frank-Wolfe algorithm with memoization and sparse updates that has a low $O(N^2 + K^2)$ per-iteration complexity. We analyze it and demonstrate its empirical superiority on synthetic and open-source NLP datasets.} }
Endnote
%0 Conference Paper %T Comparing Few to Rank Many: Active Human Preference Learning Using Randomized Frank-Wolfe Method %A Kiran Koshy Thekumparampil %A Gaurush Hiranandani %A Kousha Kalantari %A Shoham Sabach %A Branislav Kveton %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-thekumparampil25a %I PMLR %P 59355--59376 %U https://proceedings.mlr.press/v267/thekumparampil25a.html %V 267 %X We study learning human preferences from limited comparison feedback, a core machine learning problem that is at the center of reinforcement learning from human feedback (RLHF). We formulate the problem as learning a Plackett-Luce (PL) model from a limited number of $K$-subset comparisons over a universe of $N$ items, where typically $K \ll N$. Our objective is to select the $K$-subsets such that all items can be ranked with minimal mistakes within the budget. We solve the problem using the D-optimal design, which minimizes the worst-case ranking loss under the estimated PL model. All known algorithms for this problem are computationally infeasible in our setting because we consider exponentially many subsets in $K$. To address this challenge, we propose a randomized Frank-Wolfe algorithm with memoization and sparse updates that has a low $O(N^2 + K^2)$ per-iteration complexity. We analyze it and demonstrate its empirical superiority on synthetic and open-source NLP datasets.
APA
Thekumparampil, K.K., Hiranandani, G., Kalantari, K., Sabach, S. & Kveton, B.. (2025). Comparing Few to Rank Many: Active Human Preference Learning Using Randomized Frank-Wolfe Method. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:59355-59376 Available from https://proceedings.mlr.press/v267/thekumparampil25a.html.

Related Material