Diverging Preferences: When do Annotators Disagree and do Models Know?

Michael Jq Zhang, Zhilin Wang, Jena D. Hwang, Yi Dong, Olivier Delalleau, Yejin Choi, Eunsol Choi, Xiang Ren, Valentina Pyatkin
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:76193-76212, 2025.

Abstract

We examine diverging preferences in human-labeled preference datasets. We develop a taxonomy of disagreement sources spanning ten categories across four high-level classes and find that the majority of disagreements are due to factors such as task underspecification or response style. Our findings challenge a standard assumption in reward modeling methods that annotator disagreements can be attributed to simple noise. We then explore how these findings impact two areas of LLM development: reward modeling training and evaluation. In our experiments, we demonstrate how standard reward modeling (e.g., Bradley-Terry) and LLM-as-Judge evaluation methods fail to account for divergence between annotators. These findings highlight challenges in LLM evaluations, which are greatly influenced by divisive features like response style, and in developing pluralistically aligned LLMs. To address these issues, we develop methods for identifying diverging preferences to mitigate their influence in evaluations and during LLM training.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-zhang25bx, title = {Diverging Preferences: When do Annotators Disagree and do Models Know?}, author = {Zhang, Michael Jq and Wang, Zhilin and Hwang, Jena D. and Dong, Yi and Delalleau, Olivier and Choi, Yejin and Choi, Eunsol and Ren, Xiang and Pyatkin, Valentina}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {76193--76212}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/zhang25bx/zhang25bx.pdf}, url = {https://proceedings.mlr.press/v267/zhang25bx.html}, abstract = {We examine diverging preferences in human-labeled preference datasets. We develop a taxonomy of disagreement sources spanning ten categories across four high-level classes and find that the majority of disagreements are due to factors such as task underspecification or response style. Our findings challenge a standard assumption in reward modeling methods that annotator disagreements can be attributed to simple noise. We then explore how these findings impact two areas of LLM development: reward modeling training and evaluation. In our experiments, we demonstrate how standard reward modeling (e.g., Bradley-Terry) and LLM-as-Judge evaluation methods fail to account for divergence between annotators. These findings highlight challenges in LLM evaluations, which are greatly influenced by divisive features like response style, and in developing pluralistically aligned LLMs. To address these issues, we develop methods for identifying diverging preferences to mitigate their influence in evaluations and during LLM training.} }
Endnote
%0 Conference Paper %T Diverging Preferences: When do Annotators Disagree and do Models Know? %A Michael Jq Zhang %A Zhilin Wang %A Jena D. Hwang %A Yi Dong %A Olivier Delalleau %A Yejin Choi %A Eunsol Choi %A Xiang Ren %A Valentina Pyatkin %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-zhang25bx %I PMLR %P 76193--76212 %U https://proceedings.mlr.press/v267/zhang25bx.html %V 267 %X We examine diverging preferences in human-labeled preference datasets. We develop a taxonomy of disagreement sources spanning ten categories across four high-level classes and find that the majority of disagreements are due to factors such as task underspecification or response style. Our findings challenge a standard assumption in reward modeling methods that annotator disagreements can be attributed to simple noise. We then explore how these findings impact two areas of LLM development: reward modeling training and evaluation. In our experiments, we demonstrate how standard reward modeling (e.g., Bradley-Terry) and LLM-as-Judge evaluation methods fail to account for divergence between annotators. These findings highlight challenges in LLM evaluations, which are greatly influenced by divisive features like response style, and in developing pluralistically aligned LLMs. To address these issues, we develop methods for identifying diverging preferences to mitigate their influence in evaluations and during LLM training.
APA
Zhang, M.J., Wang, Z., Hwang, J.D., Dong, Y., Delalleau, O., Choi, Y., Choi, E., Ren, X. & Pyatkin, V.. (2025). Diverging Preferences: When do Annotators Disagree and do Models Know?. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:76193-76212 Available from https://proceedings.mlr.press/v267/zhang25bx.html.

Related Material