Position: AI Should Not Be An Imitation Game: Centaur Evaluations

Andreas Haupt, Erik Brynjolfsson
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:81526-81541, 2025.

Abstract

Benchmarks and evaluations are central to machine learning methodology and direct research in the field. Current evaluations commonly test systems in the absence of humans. This position paper argues that the machine learning community should increasingly use centaur evaluations, in which humans and AI jointly solve tasks. Centaur Evaluations refocus machine learning development toward human augmentation instead of human replacement, they allow for direct evaluation of human-centered desiderata, such as interpretability and helpfulness, and they can be more challenging and realistic than existing evaluations. By shifting the focus from automation toward collaboration between humans and AI, centaur evaluations can drive progress toward more effective and human-augmenting machine learning systems.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-haupt25a, title = {Position: {AI} Should Not Be An Imitation Game: Centaur Evaluations}, author = {Haupt, Andreas and Brynjolfsson, Erik}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {81526--81541}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/haupt25a/haupt25a.pdf}, url = {https://proceedings.mlr.press/v267/haupt25a.html}, abstract = {Benchmarks and evaluations are central to machine learning methodology and direct research in the field. Current evaluations commonly test systems in the absence of humans. This position paper argues that the machine learning community should increasingly use centaur evaluations, in which humans and AI jointly solve tasks. Centaur Evaluations refocus machine learning development toward human augmentation instead of human replacement, they allow for direct evaluation of human-centered desiderata, such as interpretability and helpfulness, and they can be more challenging and realistic than existing evaluations. By shifting the focus from automation toward collaboration between humans and AI, centaur evaluations can drive progress toward more effective and human-augmenting machine learning systems.} }
Endnote
%0 Conference Paper %T Position: AI Should Not Be An Imitation Game: Centaur Evaluations %A Andreas Haupt %A Erik Brynjolfsson %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-haupt25a %I PMLR %P 81526--81541 %U https://proceedings.mlr.press/v267/haupt25a.html %V 267 %X Benchmarks and evaluations are central to machine learning methodology and direct research in the field. Current evaluations commonly test systems in the absence of humans. This position paper argues that the machine learning community should increasingly use centaur evaluations, in which humans and AI jointly solve tasks. Centaur Evaluations refocus machine learning development toward human augmentation instead of human replacement, they allow for direct evaluation of human-centered desiderata, such as interpretability and helpfulness, and they can be more challenging and realistic than existing evaluations. By shifting the focus from automation toward collaboration between humans and AI, centaur evaluations can drive progress toward more effective and human-augmenting machine learning systems.
APA
Haupt, A. & Brynjolfsson, E.. (2025). Position: AI Should Not Be An Imitation Game: Centaur Evaluations. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:81526-81541 Available from https://proceedings.mlr.press/v267/haupt25a.html.

Related Material