Do Large Language Models Perform the Way People Expect? Measuring the Human Generalization Function

Keyon Vafa, Ashesh Rambachan, Sendhil Mullainathan
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:48919-48937, 2024.

Abstract

What makes large language models (LLMs) impressive is also what makes them hard to evaluate: their diversity of uses. To evaluate these models, we must understand the purposes they will be used for. We consider a setting where these deployment decisions are made by people, and in particular, people’s beliefs about where an LLM will perform well. We model such beliefs as the consequence of a human generalization function: having seen what an LLM gets right or wrong, people generalize to where else it might succeed. We collect a dataset of 19K examples of how humans make generalizations across 79 tasks from the MMLU and BIG-Bench benchmarks. We show that the human generalization function can be predicted using NLP methods: people have consistent structured ways to generalize. We then evaluate LLM alignment with the human generalization function. Our results show that – especially for cases where the cost of mistakes is high – more capable models (e.g. GPT-4) can do worse on the instances people choose to use them for, exactly because they are not aligned with the human generalization function.

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-vafa24a, title = {Do Large Language Models Perform the Way People Expect? {M}easuring the Human Generalization Function}, author = {Vafa, Keyon and Rambachan, Ashesh and Mullainathan, Sendhil}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {48919--48937}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/vafa24a/vafa24a.pdf}, url = {https://proceedings.mlr.press/v235/vafa24a.html}, abstract = {What makes large language models (LLMs) impressive is also what makes them hard to evaluate: their diversity of uses. To evaluate these models, we must understand the purposes they will be used for. We consider a setting where these deployment decisions are made by people, and in particular, people’s beliefs about where an LLM will perform well. We model such beliefs as the consequence of a human generalization function: having seen what an LLM gets right or wrong, people generalize to where else it might succeed. We collect a dataset of 19K examples of how humans make generalizations across 79 tasks from the MMLU and BIG-Bench benchmarks. We show that the human generalization function can be predicted using NLP methods: people have consistent structured ways to generalize. We then evaluate LLM alignment with the human generalization function. Our results show that – especially for cases where the cost of mistakes is high – more capable models (e.g. GPT-4) can do worse on the instances people choose to use them for, exactly because they are not aligned with the human generalization function.} }
Endnote
%0 Conference Paper %T Do Large Language Models Perform the Way People Expect? Measuring the Human Generalization Function %A Keyon Vafa %A Ashesh Rambachan %A Sendhil Mullainathan %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-vafa24a %I PMLR %P 48919--48937 %U https://proceedings.mlr.press/v235/vafa24a.html %V 235 %X What makes large language models (LLMs) impressive is also what makes them hard to evaluate: their diversity of uses. To evaluate these models, we must understand the purposes they will be used for. We consider a setting where these deployment decisions are made by people, and in particular, people’s beliefs about where an LLM will perform well. We model such beliefs as the consequence of a human generalization function: having seen what an LLM gets right or wrong, people generalize to where else it might succeed. We collect a dataset of 19K examples of how humans make generalizations across 79 tasks from the MMLU and BIG-Bench benchmarks. We show that the human generalization function can be predicted using NLP methods: people have consistent structured ways to generalize. We then evaluate LLM alignment with the human generalization function. Our results show that – especially for cases where the cost of mistakes is high – more capable models (e.g. GPT-4) can do worse on the instances people choose to use them for, exactly because they are not aligned with the human generalization function.
APA
Vafa, K., Rambachan, A. & Mullainathan, S.. (2024). Do Large Language Models Perform the Way People Expect? Measuring the Human Generalization Function. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:48919-48937 Available from https://proceedings.mlr.press/v235/vafa24a.html.

Related Material