Potemkin Understanding in Large Language Models

Marina Mancoridis, Bec Weeks, Keyon Vafa, Sendhil Mullainathan
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:42857-42881, 2025.

Abstract

Large language models (LLMs) are regularly evaluated using benchmark datasets. But what justifies making inferences about an LLM’s capabilities based on its answers to a curated set of questions? This paper first introduces a formal framework to address this question. The key is to note that the benchmarks used to test LLMs—such as AP exams—are also those used to test people. However, this raises an implication: such benchmarks are only valid tests if LLMs misunderstand concepts in ways that mirror human misunderstandings. Otherwise, success on benchmarks only demonstrates potemkin understanding: the illusion of understanding driven by answers irreconcilable with how any human would interpret a concept. We present two procedures for quantifying the existence of potemkins: one using a specially designed benchmark in three domains, the other using a general procedure that provides a lower-bound on their prevalence. We find that potemkins are ubiquitous across models, tasks, and domains. We also find that these failures reflect not just incorrect understanding, but deeper internal incoherence in concept representations.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-mancoridis25a, title = {Potemkin Understanding in Large Language Models}, author = {Mancoridis, Marina and Weeks, Bec and Vafa, Keyon and Mullainathan, Sendhil}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {42857--42881}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/mancoridis25a/mancoridis25a.pdf}, url = {https://proceedings.mlr.press/v267/mancoridis25a.html}, abstract = {Large language models (LLMs) are regularly evaluated using benchmark datasets. But what justifies making inferences about an LLM’s capabilities based on its answers to a curated set of questions? This paper first introduces a formal framework to address this question. The key is to note that the benchmarks used to test LLMs—such as AP exams—are also those used to test people. However, this raises an implication: such benchmarks are only valid tests if LLMs misunderstand concepts in ways that mirror human misunderstandings. Otherwise, success on benchmarks only demonstrates potemkin understanding: the illusion of understanding driven by answers irreconcilable with how any human would interpret a concept. We present two procedures for quantifying the existence of potemkins: one using a specially designed benchmark in three domains, the other using a general procedure that provides a lower-bound on their prevalence. We find that potemkins are ubiquitous across models, tasks, and domains. We also find that these failures reflect not just incorrect understanding, but deeper internal incoherence in concept representations.} }
Endnote
%0 Conference Paper %T Potemkin Understanding in Large Language Models %A Marina Mancoridis %A Bec Weeks %A Keyon Vafa %A Sendhil Mullainathan %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-mancoridis25a %I PMLR %P 42857--42881 %U https://proceedings.mlr.press/v267/mancoridis25a.html %V 267 %X Large language models (LLMs) are regularly evaluated using benchmark datasets. But what justifies making inferences about an LLM’s capabilities based on its answers to a curated set of questions? This paper first introduces a formal framework to address this question. The key is to note that the benchmarks used to test LLMs—such as AP exams—are also those used to test people. However, this raises an implication: such benchmarks are only valid tests if LLMs misunderstand concepts in ways that mirror human misunderstandings. Otherwise, success on benchmarks only demonstrates potemkin understanding: the illusion of understanding driven by answers irreconcilable with how any human would interpret a concept. We present two procedures for quantifying the existence of potemkins: one using a specially designed benchmark in three domains, the other using a general procedure that provides a lower-bound on their prevalence. We find that potemkins are ubiquitous across models, tasks, and domains. We also find that these failures reflect not just incorrect understanding, but deeper internal incoherence in concept representations.
APA
Mancoridis, M., Weeks, B., Vafa, K. & Mullainathan, S.. (2025). Potemkin Understanding in Large Language Models. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:42857-42881 Available from https://proceedings.mlr.press/v267/mancoridis25a.html.

Related Material