Putnam-AXIOM: A Functional & Static Benchmark for Measuring Higher Level Mathematical Reasoning in LLMs

Aryan Gulati, Brando Miranda, Eric Chen, Emily Xia, Kai Fronsdal, Bruno De Moraes Dumont, Sanmi Koyejo
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:20723-20747, 2025.

Abstract

Current mathematical reasoning benchmarks for large language models (LLMs) are approaching saturation, with some achieving $>$ 90% accuracy, and are increasingly compromised by training-set contamination. We introduce Putnam-AXIOM, a benchmark of 522 university-level competition problems drawn from the prestigious William Lowell Putnam Mathematical Competition, and Putnam-AXIOM Variation, an unseen companion set of 100 functional variants generated by programmatically perturbing variables, and constants. The variation protocol produces an unlimited stream of equally difficult, unseen instances – yielding a contamination-resilient test bed. On the Original set, OpenAI’s o1-preview – the strongest evaluated model – scores 41.9%, but its accuracy drops by 19.6 % (46.8% relative decrease) on the paired Variations. The remaining eighteen models show the same downward trend, ten of them with non-overlapping 95% confidence intervals. These gaps suggest memorization and highlight the necessity of dynamic benchmarks. We complement ("boxed") accuracy with Teacher-Forced Accuracy (TFA), a lightweight metric that directly scores reasoning traces and automates natural language proof evaluations. Putnam-AXIOM therefore provides a rigorous, contamination-resilient evaluation framework for assessing advanced mathematical reasoning of LLMs. Data and evaluation code are publicly available at https://github.com/brando90/putnam-axiom.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-gulati25a, title = {Putnam-{AXIOM}: A Functional & Static Benchmark for Measuring Higher Level Mathematical Reasoning in {LLM}s}, author = {Gulati, Aryan and Miranda, Brando and Chen, Eric and Xia, Emily and Fronsdal, Kai and De Moraes Dumont, Bruno and Koyejo, Sanmi}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {20723--20747}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/gulati25a/gulati25a.pdf}, url = {https://proceedings.mlr.press/v267/gulati25a.html}, abstract = {Current mathematical reasoning benchmarks for large language models (LLMs) are approaching saturation, with some achieving $>$ 90% accuracy, and are increasingly compromised by training-set contamination. We introduce Putnam-AXIOM, a benchmark of 522 university-level competition problems drawn from the prestigious William Lowell Putnam Mathematical Competition, and Putnam-AXIOM Variation, an unseen companion set of 100 functional variants generated by programmatically perturbing variables, and constants. The variation protocol produces an unlimited stream of equally difficult, unseen instances – yielding a contamination-resilient test bed. On the Original set, OpenAI’s o1-preview – the strongest evaluated model – scores 41.9%, but its accuracy drops by 19.6 % (46.8% relative decrease) on the paired Variations. The remaining eighteen models show the same downward trend, ten of them with non-overlapping 95% confidence intervals. These gaps suggest memorization and highlight the necessity of dynamic benchmarks. We complement ("boxed") accuracy with Teacher-Forced Accuracy (TFA), a lightweight metric that directly scores reasoning traces and automates natural language proof evaluations. Putnam-AXIOM therefore provides a rigorous, contamination-resilient evaluation framework for assessing advanced mathematical reasoning of LLMs. Data and evaluation code are publicly available at https://github.com/brando90/putnam-axiom.} }
Endnote
%0 Conference Paper %T Putnam-AXIOM: A Functional & Static Benchmark for Measuring Higher Level Mathematical Reasoning in LLMs %A Aryan Gulati %A Brando Miranda %A Eric Chen %A Emily Xia %A Kai Fronsdal %A Bruno De Moraes Dumont %A Sanmi Koyejo %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-gulati25a %I PMLR %P 20723--20747 %U https://proceedings.mlr.press/v267/gulati25a.html %V 267 %X Current mathematical reasoning benchmarks for large language models (LLMs) are approaching saturation, with some achieving $>$ 90% accuracy, and are increasingly compromised by training-set contamination. We introduce Putnam-AXIOM, a benchmark of 522 university-level competition problems drawn from the prestigious William Lowell Putnam Mathematical Competition, and Putnam-AXIOM Variation, an unseen companion set of 100 functional variants generated by programmatically perturbing variables, and constants. The variation protocol produces an unlimited stream of equally difficult, unseen instances – yielding a contamination-resilient test bed. On the Original set, OpenAI’s o1-preview – the strongest evaluated model – scores 41.9%, but its accuracy drops by 19.6 % (46.8% relative decrease) on the paired Variations. The remaining eighteen models show the same downward trend, ten of them with non-overlapping 95% confidence intervals. These gaps suggest memorization and highlight the necessity of dynamic benchmarks. We complement ("boxed") accuracy with Teacher-Forced Accuracy (TFA), a lightweight metric that directly scores reasoning traces and automates natural language proof evaluations. Putnam-AXIOM therefore provides a rigorous, contamination-resilient evaluation framework for assessing advanced mathematical reasoning of LLMs. Data and evaluation code are publicly available at https://github.com/brando90/putnam-axiom.
APA
Gulati, A., Miranda, B., Chen, E., Xia, E., Fronsdal, K., De Moraes Dumont, B. & Koyejo, S.. (2025). Putnam-AXIOM: A Functional & Static Benchmark for Measuring Higher Level Mathematical Reasoning in LLMs. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:20723-20747 Available from https://proceedings.mlr.press/v267/gulati25a.html.

Related Material