Investigating Non-Transitivity in LLM-as-a-Judge

Yi Xu, Laura Ruis, Tim Rocktäschel, Robert Kirk
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:69583-69612, 2025.

Abstract

Automatic evaluation methods based on large language models (LLMs) are emerging as the standard tool for assessing the instruction-following abilities of LLM-based agents. The most common method in this paradigm, pairwise comparisons with a baseline model, critically depends on the assumption of transitive preferences. However, the validity of this assumption remains largely unexplored. In this study, we investigate the presence of non-transitivity within the AlpacaEval framework and analyze its effects on model rankings. We find that LLM judges exhibit non-transitive preferences, leading to rankings that are sensitive to the choice of the baseline model. To mitigate this issue, we show that round-robin tournaments combined with Bradley-Terry models of preference can produce more reliable rankings. Notably, our method increases both the Spearman correlation and the Kendall correlation with Chatbot Arena (95.0% $\rightarrow$ 96.4% and 82.1% $\rightarrow$ 86.3% respectively). To address the computational cost of round-robin tournaments, we propose Swiss-Wise Iterative Matchmaking (Swim) tournaments, using a dynamic matching strategy to capture the benefits of round-robin tournaments while maintaining computational efficiency.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-xu25w, title = {Investigating Non-Transitivity in {LLM}-as-a-Judge}, author = {Xu, Yi and Ruis, Laura and Rockt\"{a}schel, Tim and Kirk, Robert}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {69583--69612}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/xu25w/xu25w.pdf}, url = {https://proceedings.mlr.press/v267/xu25w.html}, abstract = {Automatic evaluation methods based on large language models (LLMs) are emerging as the standard tool for assessing the instruction-following abilities of LLM-based agents. The most common method in this paradigm, pairwise comparisons with a baseline model, critically depends on the assumption of transitive preferences. However, the validity of this assumption remains largely unexplored. In this study, we investigate the presence of non-transitivity within the AlpacaEval framework and analyze its effects on model rankings. We find that LLM judges exhibit non-transitive preferences, leading to rankings that are sensitive to the choice of the baseline model. To mitigate this issue, we show that round-robin tournaments combined with Bradley-Terry models of preference can produce more reliable rankings. Notably, our method increases both the Spearman correlation and the Kendall correlation with Chatbot Arena (95.0% $\rightarrow$ 96.4% and 82.1% $\rightarrow$ 86.3% respectively). To address the computational cost of round-robin tournaments, we propose Swiss-Wise Iterative Matchmaking (Swim) tournaments, using a dynamic matching strategy to capture the benefits of round-robin tournaments while maintaining computational efficiency.} }
Endnote
%0 Conference Paper %T Investigating Non-Transitivity in LLM-as-a-Judge %A Yi Xu %A Laura Ruis %A Tim Rocktäschel %A Robert Kirk %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-xu25w %I PMLR %P 69583--69612 %U https://proceedings.mlr.press/v267/xu25w.html %V 267 %X Automatic evaluation methods based on large language models (LLMs) are emerging as the standard tool for assessing the instruction-following abilities of LLM-based agents. The most common method in this paradigm, pairwise comparisons with a baseline model, critically depends on the assumption of transitive preferences. However, the validity of this assumption remains largely unexplored. In this study, we investigate the presence of non-transitivity within the AlpacaEval framework and analyze its effects on model rankings. We find that LLM judges exhibit non-transitive preferences, leading to rankings that are sensitive to the choice of the baseline model. To mitigate this issue, we show that round-robin tournaments combined with Bradley-Terry models of preference can produce more reliable rankings. Notably, our method increases both the Spearman correlation and the Kendall correlation with Chatbot Arena (95.0% $\rightarrow$ 96.4% and 82.1% $\rightarrow$ 86.3% respectively). To address the computational cost of round-robin tournaments, we propose Swiss-Wise Iterative Matchmaking (Swim) tournaments, using a dynamic matching strategy to capture the benefits of round-robin tournaments while maintaining computational efficiency.
APA
Xu, Y., Ruis, L., Rocktäschel, T. & Kirk, R.. (2025). Investigating Non-Transitivity in LLM-as-a-Judge. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:69583-69612 Available from https://proceedings.mlr.press/v267/xu25w.html.

Related Material