Position: AI Evaluation Should Learn from How We Test Humans

Yan Zhuang, Qi Liu, Zachary Pardos, Patrick C. Kyllonen, Jiyun Zu, Zhenya Huang, Shijin Wang, Enhong Chen
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:82483-82508, 2025.

Abstract

As AI systems continue to evolve, their rigorous evaluation becomes crucial for their development and deployment. Researchers have constructed various large-scale benchmarks to determine their capabilities, typically against a gold-standard test set and report metrics averaged across all items. However, this static evaluation paradigm increasingly shows its limitations, including high evaluation costs, data contamination, and the impact of low-quality or erroneous items on evaluation reliability and efficiency. In this Position, drawing from human psychometrics, we discuss a paradigm shift from static evaluation methods to adaptive testing. This involves estimating the characteristics or value of each test item in the benchmark, and tailoring each model’s evaluation instead of relying on a fixed test set. This paradigm provides robust ability estimation, uncovering the latent traits underlying a model’s observed scores. This position paper analyze the current possibilities, prospects, and reasons for adopting psychometrics in AI evaluation. We argue that psychometrics, a theory originating in the 20th century for human assessment, could be a powerful solution to the challenges in today’s AI evaluations.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-zhuang25e, title = {Position: {AI} Evaluation Should Learn from How We Test Humans}, author = {Zhuang, Yan and Liu, Qi and Pardos, Zachary and Kyllonen, Patrick C. and Zu, Jiyun and Huang, Zhenya and Wang, Shijin and Chen, Enhong}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {82483--82508}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/zhuang25e/zhuang25e.pdf}, url = {https://proceedings.mlr.press/v267/zhuang25e.html}, abstract = {As AI systems continue to evolve, their rigorous evaluation becomes crucial for their development and deployment. Researchers have constructed various large-scale benchmarks to determine their capabilities, typically against a gold-standard test set and report metrics averaged across all items. However, this static evaluation paradigm increasingly shows its limitations, including high evaluation costs, data contamination, and the impact of low-quality or erroneous items on evaluation reliability and efficiency. In this Position, drawing from human psychometrics, we discuss a paradigm shift from static evaluation methods to adaptive testing. This involves estimating the characteristics or value of each test item in the benchmark, and tailoring each model’s evaluation instead of relying on a fixed test set. This paradigm provides robust ability estimation, uncovering the latent traits underlying a model’s observed scores. This position paper analyze the current possibilities, prospects, and reasons for adopting psychometrics in AI evaluation. We argue that psychometrics, a theory originating in the 20th century for human assessment, could be a powerful solution to the challenges in today’s AI evaluations.} }
Endnote
%0 Conference Paper %T Position: AI Evaluation Should Learn from How We Test Humans %A Yan Zhuang %A Qi Liu %A Zachary Pardos %A Patrick C. Kyllonen %A Jiyun Zu %A Zhenya Huang %A Shijin Wang %A Enhong Chen %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-zhuang25e %I PMLR %P 82483--82508 %U https://proceedings.mlr.press/v267/zhuang25e.html %V 267 %X As AI systems continue to evolve, their rigorous evaluation becomes crucial for their development and deployment. Researchers have constructed various large-scale benchmarks to determine their capabilities, typically against a gold-standard test set and report metrics averaged across all items. However, this static evaluation paradigm increasingly shows its limitations, including high evaluation costs, data contamination, and the impact of low-quality or erroneous items on evaluation reliability and efficiency. In this Position, drawing from human psychometrics, we discuss a paradigm shift from static evaluation methods to adaptive testing. This involves estimating the characteristics or value of each test item in the benchmark, and tailoring each model’s evaluation instead of relying on a fixed test set. This paradigm provides robust ability estimation, uncovering the latent traits underlying a model’s observed scores. This position paper analyze the current possibilities, prospects, and reasons for adopting psychometrics in AI evaluation. We argue that psychometrics, a theory originating in the 20th century for human assessment, could be a powerful solution to the challenges in today’s AI evaluations.
APA
Zhuang, Y., Liu, Q., Pardos, Z., Kyllonen, P.C., Zu, J., Huang, Z., Wang, S. & Chen, E.. (2025). Position: AI Evaluation Should Learn from How We Test Humans. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:82483-82508 Available from https://proceedings.mlr.press/v267/zhuang25e.html.

Related Material