Position: Evaluating Generative AI Systems Is a Social Science Measurement Challenge

Hanna Wallach, Meera Desai, A. Feder Cooper, Angelina Wang, Chad Atalla, Solon Barocas, Su Lin Blodgett, Alexandra Chouldechova, Emily Corvi, P. Alex Dow, Jean Garcia-Gathright, Alexandra Olteanu, Nicholas J Pangakis, Stefanie Reed, Emily Sheng, Dan Vann, Jennifer Wortman Vaughan, Matthew Vogel, Hannah Washington, Abigail Z. Jacobs
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:82232-82251, 2025.

Abstract

The measurement tasks involved in evaluating generative AI (GenAI) systems lack sufficient scientific rigor, leading to what has been described as "a tangle of sloppy tests [and] apples-to-oranges comparisons" (Roose, 2024). In this position paper, we argue that the ML community would benefit from learning from and drawing on the social sciences when developing and using measurement instruments for evaluating GenAI systems. Specifically, our position is that evaluating GenAI systems is a social science measurement challenge. We present a four-level framework, grounded in measurement theory from the social sciences, for measuring concepts related to the capabilities, behaviors, and impacts of GenAI systems. This framework has two important implications: First, it can broaden the expertise involved in evaluating GenAI systems by enabling stakeholders with different perspectives to participate in conceptual debates. Second, it brings rigor to both conceptual and operational debates by offering a set of lenses for interrogating validity.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-wallach25a, title = {Position: Evaluating Generative {AI} Systems Is a Social Science Measurement Challenge}, author = {Wallach, Hanna and Desai, Meera and Cooper, A. Feder and Wang, Angelina and Atalla, Chad and Barocas, Solon and Blodgett, Su Lin and Chouldechova, Alexandra and Corvi, Emily and Dow, P. Alex and Garcia-Gathright, Jean and Olteanu, Alexandra and Pangakis, Nicholas J and Reed, Stefanie and Sheng, Emily and Vann, Dan and Vaughan, Jennifer Wortman and Vogel, Matthew and Washington, Hannah and Jacobs, Abigail Z.}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {82232--82251}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/wallach25a/wallach25a.pdf}, url = {https://proceedings.mlr.press/v267/wallach25a.html}, abstract = {The measurement tasks involved in evaluating generative AI (GenAI) systems lack sufficient scientific rigor, leading to what has been described as "a tangle of sloppy tests [and] apples-to-oranges comparisons" (Roose, 2024). In this position paper, we argue that the ML community would benefit from learning from and drawing on the social sciences when developing and using measurement instruments for evaluating GenAI systems. Specifically, our position is that evaluating GenAI systems is a social science measurement challenge. We present a four-level framework, grounded in measurement theory from the social sciences, for measuring concepts related to the capabilities, behaviors, and impacts of GenAI systems. This framework has two important implications: First, it can broaden the expertise involved in evaluating GenAI systems by enabling stakeholders with different perspectives to participate in conceptual debates. Second, it brings rigor to both conceptual and operational debates by offering a set of lenses for interrogating validity.} }
Endnote
%0 Conference Paper %T Position: Evaluating Generative AI Systems Is a Social Science Measurement Challenge %A Hanna Wallach %A Meera Desai %A A. Feder Cooper %A Angelina Wang %A Chad Atalla %A Solon Barocas %A Su Lin Blodgett %A Alexandra Chouldechova %A Emily Corvi %A P. Alex Dow %A Jean Garcia-Gathright %A Alexandra Olteanu %A Nicholas J Pangakis %A Stefanie Reed %A Emily Sheng %A Dan Vann %A Jennifer Wortman Vaughan %A Matthew Vogel %A Hannah Washington %A Abigail Z. Jacobs %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-wallach25a %I PMLR %P 82232--82251 %U https://proceedings.mlr.press/v267/wallach25a.html %V 267 %X The measurement tasks involved in evaluating generative AI (GenAI) systems lack sufficient scientific rigor, leading to what has been described as "a tangle of sloppy tests [and] apples-to-oranges comparisons" (Roose, 2024). In this position paper, we argue that the ML community would benefit from learning from and drawing on the social sciences when developing and using measurement instruments for evaluating GenAI systems. Specifically, our position is that evaluating GenAI systems is a social science measurement challenge. We present a four-level framework, grounded in measurement theory from the social sciences, for measuring concepts related to the capabilities, behaviors, and impacts of GenAI systems. This framework has two important implications: First, it can broaden the expertise involved in evaluating GenAI systems by enabling stakeholders with different perspectives to participate in conceptual debates. Second, it brings rigor to both conceptual and operational debates by offering a set of lenses for interrogating validity.
APA
Wallach, H., Desai, M., Cooper, A.F., Wang, A., Atalla, C., Barocas, S., Blodgett, S.L., Chouldechova, A., Corvi, E., Dow, P.A., Garcia-Gathright, J., Olteanu, A., Pangakis, N.J., Reed, S., Sheng, E., Vann, D., Vaughan, J.W., Vogel, M., Washington, H. & Jacobs, A.Z.. (2025). Position: Evaluating Generative AI Systems Is a Social Science Measurement Challenge. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:82232-82251 Available from https://proceedings.mlr.press/v267/wallach25a.html.

Related Material