Reward Modeling with Ordinal Feedback: Wisdom of the Crowd

Shang Liu, Yu Pan, Guanting Chen, Xiaocheng Li
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:39190-39218, 2025.

Abstract

The canonical setup of learning a reward model (RM) from human preferences with binary feedback discards potentially useful samples (such as "tied" between the two responses) and loses fine-grained information (such as "slightly better’"). This paper proposes a framework for learning RMs under ordinal feedback, generalizing the binary feedback to arbitrary granularity. We first identify a marginal unbiasedness condition, which generalizes the existing assumption of the binary feedback. The condition is validated via the sociological concept called "wisdom of the crowd". Under this condition, we develop a natural probability model and prove the benefits of fine-grained feedback in terms of reducing the Rademacher complexity, which may be of independent interest to another problem: the bias-variance trade-off in knowledge distillation. The framework also sheds light on designing guidelines for human annotators. Our numerical experiments validate that: (1) fine-grained feedback leads to better RM learning for both in- and out-of-distribution settings; (2) incorporating a certain proportion of tied samples boosts RM learning.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-liu25az, title = {Reward Modeling with Ordinal Feedback: Wisdom of the Crowd}, author = {Liu, Shang and Pan, Yu and Chen, Guanting and Li, Xiaocheng}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {39190--39218}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/liu25az/liu25az.pdf}, url = {https://proceedings.mlr.press/v267/liu25az.html}, abstract = {The canonical setup of learning a reward model (RM) from human preferences with binary feedback discards potentially useful samples (such as "tied" between the two responses) and loses fine-grained information (such as "slightly better’"). This paper proposes a framework for learning RMs under ordinal feedback, generalizing the binary feedback to arbitrary granularity. We first identify a marginal unbiasedness condition, which generalizes the existing assumption of the binary feedback. The condition is validated via the sociological concept called "wisdom of the crowd". Under this condition, we develop a natural probability model and prove the benefits of fine-grained feedback in terms of reducing the Rademacher complexity, which may be of independent interest to another problem: the bias-variance trade-off in knowledge distillation. The framework also sheds light on designing guidelines for human annotators. Our numerical experiments validate that: (1) fine-grained feedback leads to better RM learning for both in- and out-of-distribution settings; (2) incorporating a certain proportion of tied samples boosts RM learning.} }
Endnote
%0 Conference Paper %T Reward Modeling with Ordinal Feedback: Wisdom of the Crowd %A Shang Liu %A Yu Pan %A Guanting Chen %A Xiaocheng Li %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-liu25az %I PMLR %P 39190--39218 %U https://proceedings.mlr.press/v267/liu25az.html %V 267 %X The canonical setup of learning a reward model (RM) from human preferences with binary feedback discards potentially useful samples (such as "tied" between the two responses) and loses fine-grained information (such as "slightly better’"). This paper proposes a framework for learning RMs under ordinal feedback, generalizing the binary feedback to arbitrary granularity. We first identify a marginal unbiasedness condition, which generalizes the existing assumption of the binary feedback. The condition is validated via the sociological concept called "wisdom of the crowd". Under this condition, we develop a natural probability model and prove the benefits of fine-grained feedback in terms of reducing the Rademacher complexity, which may be of independent interest to another problem: the bias-variance trade-off in knowledge distillation. The framework also sheds light on designing guidelines for human annotators. Our numerical experiments validate that: (1) fine-grained feedback leads to better RM learning for both in- and out-of-distribution settings; (2) incorporating a certain proportion of tied samples boosts RM learning.
APA
Liu, S., Pan, Y., Chen, G. & Li, X.. (2025). Reward Modeling with Ordinal Feedback: Wisdom of the Crowd. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:39190-39218 Available from https://proceedings.mlr.press/v267/liu25az.html.

Related Material