Position: Don’t Use the CLT in LLM Evals With Fewer Than a Few Hundred Datapoints

Sam Bowyer, Laurence Aitchison, Desi R. Ivanova
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:81143-81184, 2025.

Abstract

Rigorous statistical evaluations of large language models (LLMs), including valid error bars and significance testing, are essential for meaningful and reliable performance assessment. Currently, when such statistical measures are reported, they typically rely on the Central Limit Theorem (CLT). In this position paper, we argue that while CLT-based methods for uncertainty quantification are appropriate when benchmarks consist of thousands of examples, they fail to provide adequate uncertainty estimates for LLM evaluations that rely on smaller, highly specialized benchmarks. In these small-data settings, we demonstrate that CLT-based methods perform very poorly, usually dramatically underestimating uncertainty (i.e. producing error bars that are too small). We give recommendations for alternative frequentist and Bayesian methods that are both easy to implement and more appropriate in these increasingly common scenarios.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-bowyer25a, title = {Position: Don’t Use the {CLT} in {LLM} Evals With Fewer Than a Few Hundred Datapoints}, author = {Bowyer, Sam and Aitchison, Laurence and Ivanova, Desi R.}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {81143--81184}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/bowyer25a/bowyer25a.pdf}, url = {https://proceedings.mlr.press/v267/bowyer25a.html}, abstract = {Rigorous statistical evaluations of large language models (LLMs), including valid error bars and significance testing, are essential for meaningful and reliable performance assessment. Currently, when such statistical measures are reported, they typically rely on the Central Limit Theorem (CLT). In this position paper, we argue that while CLT-based methods for uncertainty quantification are appropriate when benchmarks consist of thousands of examples, they fail to provide adequate uncertainty estimates for LLM evaluations that rely on smaller, highly specialized benchmarks. In these small-data settings, we demonstrate that CLT-based methods perform very poorly, usually dramatically underestimating uncertainty (i.e. producing error bars that are too small). We give recommendations for alternative frequentist and Bayesian methods that are both easy to implement and more appropriate in these increasingly common scenarios.} }
Endnote
%0 Conference Paper %T Position: Don’t Use the CLT in LLM Evals With Fewer Than a Few Hundred Datapoints %A Sam Bowyer %A Laurence Aitchison %A Desi R. Ivanova %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-bowyer25a %I PMLR %P 81143--81184 %U https://proceedings.mlr.press/v267/bowyer25a.html %V 267 %X Rigorous statistical evaluations of large language models (LLMs), including valid error bars and significance testing, are essential for meaningful and reliable performance assessment. Currently, when such statistical measures are reported, they typically rely on the Central Limit Theorem (CLT). In this position paper, we argue that while CLT-based methods for uncertainty quantification are appropriate when benchmarks consist of thousands of examples, they fail to provide adequate uncertainty estimates for LLM evaluations that rely on smaller, highly specialized benchmarks. In these small-data settings, we demonstrate that CLT-based methods perform very poorly, usually dramatically underestimating uncertainty (i.e. producing error bars that are too small). We give recommendations for alternative frequentist and Bayesian methods that are both easy to implement and more appropriate in these increasingly common scenarios.
APA
Bowyer, S., Aitchison, L. & Ivanova, D.R.. (2025). Position: Don’t Use the CLT in LLM Evals With Fewer Than a Few Hundred Datapoints. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:81143-81184 Available from https://proceedings.mlr.press/v267/bowyer25a.html.

Related Material