Active Evaluation Acquisition for Efficient LLM Benchmarking

Yang Li, Jie Ma, Miguel Ballesteros, Yassine Benajiba, Graham Horwood
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:35581-35602, 2025.

Abstract

As large language models (LLMs) become increasingly versatile, numerous large scale benchmarks have been developed to thoroughly assess their capabilities. These benchmarks typically consist of diverse datasets and prompts to evaluate different aspects of LLM performance. However, comprehensive evaluations on hundreds or thousands of prompts incur tremendous costs in terms of computation, money, and time. In this work, we investigate strategies to improve evaluation efficiency by selecting a subset of examples from each benchmark using a learned policy. Our approach models the dependencies across test examples, allowing accurate prediction of the evaluation outcomes for the remaining examples based on the outcomes of the selected ones. Consequently, we only need to acquire the actual evaluation outcomes for the selected subset. We rigorously explore various subset selection policies and introduce a novel RL-based policy that leverages the captured dependencies. Empirical results demonstrate that our approach significantly reduces the number of evaluation prompts required while maintaining accurate performance estimates compared to previous methods.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-li25bp, title = {Active Evaluation Acquisition for Efficient {LLM} Benchmarking}, author = {Li, Yang and Ma, Jie and Ballesteros, Miguel and Benajiba, Yassine and Horwood, Graham}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {35581--35602}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/li25bp/li25bp.pdf}, url = {https://proceedings.mlr.press/v267/li25bp.html}, abstract = {As large language models (LLMs) become increasingly versatile, numerous large scale benchmarks have been developed to thoroughly assess their capabilities. These benchmarks typically consist of diverse datasets and prompts to evaluate different aspects of LLM performance. However, comprehensive evaluations on hundreds or thousands of prompts incur tremendous costs in terms of computation, money, and time. In this work, we investigate strategies to improve evaluation efficiency by selecting a subset of examples from each benchmark using a learned policy. Our approach models the dependencies across test examples, allowing accurate prediction of the evaluation outcomes for the remaining examples based on the outcomes of the selected ones. Consequently, we only need to acquire the actual evaluation outcomes for the selected subset. We rigorously explore various subset selection policies and introduce a novel RL-based policy that leverages the captured dependencies. Empirical results demonstrate that our approach significantly reduces the number of evaluation prompts required while maintaining accurate performance estimates compared to previous methods.} }
Endnote
%0 Conference Paper %T Active Evaluation Acquisition for Efficient LLM Benchmarking %A Yang Li %A Jie Ma %A Miguel Ballesteros %A Yassine Benajiba %A Graham Horwood %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-li25bp %I PMLR %P 35581--35602 %U https://proceedings.mlr.press/v267/li25bp.html %V 267 %X As large language models (LLMs) become increasingly versatile, numerous large scale benchmarks have been developed to thoroughly assess their capabilities. These benchmarks typically consist of diverse datasets and prompts to evaluate different aspects of LLM performance. However, comprehensive evaluations on hundreds or thousands of prompts incur tremendous costs in terms of computation, money, and time. In this work, we investigate strategies to improve evaluation efficiency by selecting a subset of examples from each benchmark using a learned policy. Our approach models the dependencies across test examples, allowing accurate prediction of the evaluation outcomes for the remaining examples based on the outcomes of the selected ones. Consequently, we only need to acquire the actual evaluation outcomes for the selected subset. We rigorously explore various subset selection policies and introduce a novel RL-based policy that leverages the captured dependencies. Empirical results demonstrate that our approach significantly reduces the number of evaluation prompts required while maintaining accurate performance estimates compared to previous methods.
APA
Li, Y., Ma, J., Ballesteros, M., Benajiba, Y. & Horwood, G.. (2025). Active Evaluation Acquisition for Efficient LLM Benchmarking. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:35581-35602 Available from https://proceedings.mlr.press/v267/li25bp.html.

Related Material