Position: Theory of Mind Benchmarks are Broken for Large Language Models

Matthew Riemer, Zahra Ashktorab, Djallel Bouneffouf, Payel Das, Miao Liu, Justin D. Weisz, Murray Campbell
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:82091-82130, 2025.

Abstract

Our paper argues that the majority of theory of mind benchmarks are broken because of their inability to directly test how large language models (LLMs) adapt to new partners. This problem stems from the fact that theory of mind benchmarks for LLMs are overwhelmingly inspired by the methods used to test theory of mind in humans and fall victim to a fallacy of attributing human-like qualities to AI agents. We expect that humans will engage in a consistent reasoning process across various questions about a situation, but this is known to not be the case for current LLMs. Most theory of mind benchmarks only measure what we call literal theory of mind: the ability to predict the behavior of others. However, this type of metric is only informative when agents exhibit self-consistent reasoning. Thus, we introduce the concept of functional theory of mind: the ability to adapt to agents in-context following a rational response to their behavior. We find that many open source LLMs are capable of displaying strong literal theory of mind capabilities, but seem to struggle with functional theory of mind – even with exceedingly simple partner policies. Simply put, strong literal theory of mind performance does not necessarily imply strong functional theory of mind performance or vice versa. Achieving functional theory of mind, particularly over long interaction horizons with a partner, is a significant challenge deserving a prominent role in any meaningful LLM theory of mind evaluation.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-riemer25a, title = {Position: Theory of Mind Benchmarks are Broken for Large Language Models}, author = {Riemer, Matthew and Ashktorab, Zahra and Bouneffouf, Djallel and Das, Payel and Liu, Miao and Weisz, Justin D. and Campbell, Murray}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {82091--82130}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/riemer25a/riemer25a.pdf}, url = {https://proceedings.mlr.press/v267/riemer25a.html}, abstract = {Our paper argues that the majority of theory of mind benchmarks are broken because of their inability to directly test how large language models (LLMs) adapt to new partners. This problem stems from the fact that theory of mind benchmarks for LLMs are overwhelmingly inspired by the methods used to test theory of mind in humans and fall victim to a fallacy of attributing human-like qualities to AI agents. We expect that humans will engage in a consistent reasoning process across various questions about a situation, but this is known to not be the case for current LLMs. Most theory of mind benchmarks only measure what we call literal theory of mind: the ability to predict the behavior of others. However, this type of metric is only informative when agents exhibit self-consistent reasoning. Thus, we introduce the concept of functional theory of mind: the ability to adapt to agents in-context following a rational response to their behavior. We find that many open source LLMs are capable of displaying strong literal theory of mind capabilities, but seem to struggle with functional theory of mind – even with exceedingly simple partner policies. Simply put, strong literal theory of mind performance does not necessarily imply strong functional theory of mind performance or vice versa. Achieving functional theory of mind, particularly over long interaction horizons with a partner, is a significant challenge deserving a prominent role in any meaningful LLM theory of mind evaluation.} }
Endnote
%0 Conference Paper %T Position: Theory of Mind Benchmarks are Broken for Large Language Models %A Matthew Riemer %A Zahra Ashktorab %A Djallel Bouneffouf %A Payel Das %A Miao Liu %A Justin D. Weisz %A Murray Campbell %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-riemer25a %I PMLR %P 82091--82130 %U https://proceedings.mlr.press/v267/riemer25a.html %V 267 %X Our paper argues that the majority of theory of mind benchmarks are broken because of their inability to directly test how large language models (LLMs) adapt to new partners. This problem stems from the fact that theory of mind benchmarks for LLMs are overwhelmingly inspired by the methods used to test theory of mind in humans and fall victim to a fallacy of attributing human-like qualities to AI agents. We expect that humans will engage in a consistent reasoning process across various questions about a situation, but this is known to not be the case for current LLMs. Most theory of mind benchmarks only measure what we call literal theory of mind: the ability to predict the behavior of others. However, this type of metric is only informative when agents exhibit self-consistent reasoning. Thus, we introduce the concept of functional theory of mind: the ability to adapt to agents in-context following a rational response to their behavior. We find that many open source LLMs are capable of displaying strong literal theory of mind capabilities, but seem to struggle with functional theory of mind – even with exceedingly simple partner policies. Simply put, strong literal theory of mind performance does not necessarily imply strong functional theory of mind performance or vice versa. Achieving functional theory of mind, particularly over long interaction horizons with a partner, is a significant challenge deserving a prominent role in any meaningful LLM theory of mind evaluation.
APA
Riemer, M., Ashktorab, Z., Bouneffouf, D., Das, P., Liu, M., Weisz, J.D. & Campbell, M.. (2025). Position: Theory of Mind Benchmarks are Broken for Large Language Models. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:82091-82130 Available from https://proceedings.mlr.press/v267/riemer25a.html.

Related Material