STEER: Assessing the Economic Rationality of Large Language Models

Narun Krishnamurthi Raman, Taylor Lundy, Samuel Joseph Amouyal, Yoav Levine, Kevin Leyton-Brown, Moshe Tennenholtz
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:42026-42047, 2024.

Abstract

There is increasing interest in using LLMs as decision-making "agents". Doing so includes many degrees of freedom: which model should be used; how should it be prompted; should it be asked to introspect, conduct chain-of-thought reasoning, etc? Settling these questions—and more broadly, determining whether an LLM agent is reliable enough to be trusted—requires a methodology for assessing such an agent’s economic rationality. In this paper, we provide one. We begin by surveying the economic literature on rational decision making, taxonomizing a large set of fine-grained "elements" that an agent should exhibit, along with dependencies between them. We then propose a benchmark distribution that quantitatively scores an LLMs performance on these elements and, combined with a user-provided rubric, produces a "rationality report card". Finally, we describe the results of a large-scale empirical experiment with 14 different LLMs, characterizing the both current state of the art and the impact of different model sizes on models’ ability to exhibit rational behavior.

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-raman24b, title = {{STEER}: Assessing the Economic Rationality of Large Language Models}, author = {Raman, Narun Krishnamurthi and Lundy, Taylor and Amouyal, Samuel Joseph and Levine, Yoav and Leyton-Brown, Kevin and Tennenholtz, Moshe}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {42026--42047}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/raman24b/raman24b.pdf}, url = {https://proceedings.mlr.press/v235/raman24b.html}, abstract = {There is increasing interest in using LLMs as decision-making "agents". Doing so includes many degrees of freedom: which model should be used; how should it be prompted; should it be asked to introspect, conduct chain-of-thought reasoning, etc? Settling these questions—and more broadly, determining whether an LLM agent is reliable enough to be trusted—requires a methodology for assessing such an agent’s economic rationality. In this paper, we provide one. We begin by surveying the economic literature on rational decision making, taxonomizing a large set of fine-grained "elements" that an agent should exhibit, along with dependencies between them. We then propose a benchmark distribution that quantitatively scores an LLMs performance on these elements and, combined with a user-provided rubric, produces a "rationality report card". Finally, we describe the results of a large-scale empirical experiment with 14 different LLMs, characterizing the both current state of the art and the impact of different model sizes on models’ ability to exhibit rational behavior.} }
Endnote
%0 Conference Paper %T STEER: Assessing the Economic Rationality of Large Language Models %A Narun Krishnamurthi Raman %A Taylor Lundy %A Samuel Joseph Amouyal %A Yoav Levine %A Kevin Leyton-Brown %A Moshe Tennenholtz %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-raman24b %I PMLR %P 42026--42047 %U https://proceedings.mlr.press/v235/raman24b.html %V 235 %X There is increasing interest in using LLMs as decision-making "agents". Doing so includes many degrees of freedom: which model should be used; how should it be prompted; should it be asked to introspect, conduct chain-of-thought reasoning, etc? Settling these questions—and more broadly, determining whether an LLM agent is reliable enough to be trusted—requires a methodology for assessing such an agent’s economic rationality. In this paper, we provide one. We begin by surveying the economic literature on rational decision making, taxonomizing a large set of fine-grained "elements" that an agent should exhibit, along with dependencies between them. We then propose a benchmark distribution that quantitatively scores an LLMs performance on these elements and, combined with a user-provided rubric, produces a "rationality report card". Finally, we describe the results of a large-scale empirical experiment with 14 different LLMs, characterizing the both current state of the art and the impact of different model sizes on models’ ability to exhibit rational behavior.
APA
Raman, N.K., Lundy, T., Amouyal, S.J., Levine, Y., Leyton-Brown, K. & Tennenholtz, M.. (2024). STEER: Assessing the Economic Rationality of Large Language Models. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:42026-42047 Available from https://proceedings.mlr.press/v235/raman24b.html.

Related Material