CogMath: Assessing LLMs’ Authentic Mathematical Ability from a Human Cognitive Perspective

Jiayu Liu, Zhenya Huang, Wei Dai, Cheng Cheng, Jinze Wu, Jing Sha, Song Li, Qi Liu, Shijin Wang, Enhong Chen
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:38692-38707, 2025.

Abstract

Although large language models (LLMs) show promise in solving complex mathematical tasks, existing evaluation paradigms rely solely on a coarse measure of overall answer accuracy, which are insufficient for assessing their authentic capabilities. In this paper, we propose CogMath, which comprehensively assesses LLMs’ mathematical abilities through the lens of human cognition. Specifically, inspired by psychological theories, CogMath formalizes human reasoning process into 3 stages: problem comprehension, problem solving, and solution summarization. Within these stages, we investigate perspectives such as numerical calculation, knowledge, and counterfactuals, and design a total of 9 fine-grained evaluation dimensions. In each dimension, we develop an “Inquiry-Judge-Reference” multi-agent system to generate inquiries that assess LLMs’ mastery from this dimension. An LLM is considered to truly master a problem only when excelling in all inquiries from the 9 dimensions. By applying CogMath on three benchmarks, we reveal that the mathematical capabilities of 7 mainstream LLMs are overestimated by 30%-40%. Moreover, we locate their strengths and weaknesses across specific stages/dimensions, offering in-depth insights to further enhance their reasoning abilities.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-liu25ab, title = {{C}og{M}ath: Assessing {LLM}s’ Authentic Mathematical Ability from a Human Cognitive Perspective}, author = {Liu, Jiayu and Huang, Zhenya and Dai, Wei and Cheng, Cheng and Wu, Jinze and Sha, Jing and Li, Song and Liu, Qi and Wang, Shijin and Chen, Enhong}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {38692--38707}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/liu25ab/liu25ab.pdf}, url = {https://proceedings.mlr.press/v267/liu25ab.html}, abstract = {Although large language models (LLMs) show promise in solving complex mathematical tasks, existing evaluation paradigms rely solely on a coarse measure of overall answer accuracy, which are insufficient for assessing their authentic capabilities. In this paper, we propose CogMath, which comprehensively assesses LLMs’ mathematical abilities through the lens of human cognition. Specifically, inspired by psychological theories, CogMath formalizes human reasoning process into 3 stages: problem comprehension, problem solving, and solution summarization. Within these stages, we investigate perspectives such as numerical calculation, knowledge, and counterfactuals, and design a total of 9 fine-grained evaluation dimensions. In each dimension, we develop an “Inquiry-Judge-Reference” multi-agent system to generate inquiries that assess LLMs’ mastery from this dimension. An LLM is considered to truly master a problem only when excelling in all inquiries from the 9 dimensions. By applying CogMath on three benchmarks, we reveal that the mathematical capabilities of 7 mainstream LLMs are overestimated by 30%-40%. Moreover, we locate their strengths and weaknesses across specific stages/dimensions, offering in-depth insights to further enhance their reasoning abilities.} }
Endnote
%0 Conference Paper %T CogMath: Assessing LLMs’ Authentic Mathematical Ability from a Human Cognitive Perspective %A Jiayu Liu %A Zhenya Huang %A Wei Dai %A Cheng Cheng %A Jinze Wu %A Jing Sha %A Song Li %A Qi Liu %A Shijin Wang %A Enhong Chen %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-liu25ab %I PMLR %P 38692--38707 %U https://proceedings.mlr.press/v267/liu25ab.html %V 267 %X Although large language models (LLMs) show promise in solving complex mathematical tasks, existing evaluation paradigms rely solely on a coarse measure of overall answer accuracy, which are insufficient for assessing their authentic capabilities. In this paper, we propose CogMath, which comprehensively assesses LLMs’ mathematical abilities through the lens of human cognition. Specifically, inspired by psychological theories, CogMath formalizes human reasoning process into 3 stages: problem comprehension, problem solving, and solution summarization. Within these stages, we investigate perspectives such as numerical calculation, knowledge, and counterfactuals, and design a total of 9 fine-grained evaluation dimensions. In each dimension, we develop an “Inquiry-Judge-Reference” multi-agent system to generate inquiries that assess LLMs’ mastery from this dimension. An LLM is considered to truly master a problem only when excelling in all inquiries from the 9 dimensions. By applying CogMath on three benchmarks, we reveal that the mathematical capabilities of 7 mainstream LLMs are overestimated by 30%-40%. Moreover, we locate their strengths and weaknesses across specific stages/dimensions, offering in-depth insights to further enhance their reasoning abilities.
APA
Liu, J., Huang, Z., Dai, W., Cheng, C., Wu, J., Sha, J., Li, S., Liu, Q., Wang, S. & Chen, E.. (2025). CogMath: Assessing LLMs’ Authentic Mathematical Ability from a Human Cognitive Perspective. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:38692-38707 Available from https://proceedings.mlr.press/v267/liu25ab.html.

Related Material