Tuning LLM Judge Design Decisions for 1/1000 of the Cost

David Salinas, Omar Swelam, Frank Hutter
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:52728-52744, 2025.

Abstract

Evaluating Large Language Models (LLMs) often requires costly human annotations. To address this, LLM-based judges have been proposed, which compare the outputs of two LLMs enabling the ranking of models without human intervention. While several approaches have been proposed, many confounding factors are present between different papers. For instance the model, the prompt and other hyperparameters are typically changed at the same time making apple-to-apple comparisons challenging. In this paper, we propose to systematically analyze and tune the hyperparameters of LLM judges. To alleviate the high cost of evaluating a judge, we propose to leverage multi-objective multi-fidelity which allows to find judges that trades accuracy for cost and also reduce significantly the cost of the search. Our method identifies judges that not only outperform existing benchmarks in accuracy and cost-efficiency but also utilize open-weight models, ensuring greater accessibility and reproducibility.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-salinas25a, title = {Tuning {LLM} Judge Design Decisions for 1/1000 of the Cost}, author = {Salinas, David and Swelam, Omar and Hutter, Frank}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {52728--52744}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/salinas25a/salinas25a.pdf}, url = {https://proceedings.mlr.press/v267/salinas25a.html}, abstract = {Evaluating Large Language Models (LLMs) often requires costly human annotations. To address this, LLM-based judges have been proposed, which compare the outputs of two LLMs enabling the ranking of models without human intervention. While several approaches have been proposed, many confounding factors are present between different papers. For instance the model, the prompt and other hyperparameters are typically changed at the same time making apple-to-apple comparisons challenging. In this paper, we propose to systematically analyze and tune the hyperparameters of LLM judges. To alleviate the high cost of evaluating a judge, we propose to leverage multi-objective multi-fidelity which allows to find judges that trades accuracy for cost and also reduce significantly the cost of the search. Our method identifies judges that not only outperform existing benchmarks in accuracy and cost-efficiency but also utilize open-weight models, ensuring greater accessibility and reproducibility.} }
Endnote
%0 Conference Paper %T Tuning LLM Judge Design Decisions for 1/1000 of the Cost %A David Salinas %A Omar Swelam %A Frank Hutter %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-salinas25a %I PMLR %P 52728--52744 %U https://proceedings.mlr.press/v267/salinas25a.html %V 267 %X Evaluating Large Language Models (LLMs) often requires costly human annotations. To address this, LLM-based judges have been proposed, which compare the outputs of two LLMs enabling the ranking of models without human intervention. While several approaches have been proposed, many confounding factors are present between different papers. For instance the model, the prompt and other hyperparameters are typically changed at the same time making apple-to-apple comparisons challenging. In this paper, we propose to systematically analyze and tune the hyperparameters of LLM judges. To alleviate the high cost of evaluating a judge, we propose to leverage multi-objective multi-fidelity which allows to find judges that trades accuracy for cost and also reduce significantly the cost of the search. Our method identifies judges that not only outperform existing benchmarks in accuracy and cost-efficiency but also utilize open-weight models, ensuring greater accessibility and reproducibility.
APA
Salinas, D., Swelam, O. & Hutter, F.. (2025). Tuning LLM Judge Design Decisions for 1/1000 of the Cost. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:52728-52744 Available from https://proceedings.mlr.press/v267/salinas25a.html.

Related Material