Softmax is not Enough (for Sharp Size Generalisation)

Petar Veličković, Christos Perivolaropoulos, Federico Barbero, Razvan Pascanu
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:61190-61211, 2025.

Abstract

A key property of reasoning systems is the ability to make sharp decisions on their input data. For contemporary AI systems, a key carrier of sharp behaviour is the softmax function, with its capability to perform differentiable query-key lookups. It is a common belief that the predictive power of networks leveraging softmax arises from "circuits" which sharply perform certain kinds of computations consistently across many diverse inputs. However, for these circuits to be robust, they would need to generalise well to arbitrary valid inputs. In this paper, we dispel this myth: even for tasks as simple as finding the maximum key, any learned circuitry must disperse as the number of items grows at test time. We attribute this to a fundamental limitation of the softmax function to robustly approximate sharp functions with increasing problem size, prove this phenomenon theoretically, and propose adaptive temperature as an ad-hoc technique for improving the sharpness of softmax at inference time.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-velickovic25a, title = {Softmax is not Enough (for Sharp Size Generalisation)}, author = {Veli\v{c}kovi\'{c}, Petar and Perivolaropoulos, Christos and Barbero, Federico and Pascanu, Razvan}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {61190--61211}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/velickovic25a/velickovic25a.pdf}, url = {https://proceedings.mlr.press/v267/velickovic25a.html}, abstract = {A key property of reasoning systems is the ability to make sharp decisions on their input data. For contemporary AI systems, a key carrier of sharp behaviour is the softmax function, with its capability to perform differentiable query-key lookups. It is a common belief that the predictive power of networks leveraging softmax arises from "circuits" which sharply perform certain kinds of computations consistently across many diverse inputs. However, for these circuits to be robust, they would need to generalise well to arbitrary valid inputs. In this paper, we dispel this myth: even for tasks as simple as finding the maximum key, any learned circuitry must disperse as the number of items grows at test time. We attribute this to a fundamental limitation of the softmax function to robustly approximate sharp functions with increasing problem size, prove this phenomenon theoretically, and propose adaptive temperature as an ad-hoc technique for improving the sharpness of softmax at inference time.} }
Endnote
%0 Conference Paper %T Softmax is not Enough (for Sharp Size Generalisation) %A Petar Veličković %A Christos Perivolaropoulos %A Federico Barbero %A Razvan Pascanu %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-velickovic25a %I PMLR %P 61190--61211 %U https://proceedings.mlr.press/v267/velickovic25a.html %V 267 %X A key property of reasoning systems is the ability to make sharp decisions on their input data. For contemporary AI systems, a key carrier of sharp behaviour is the softmax function, with its capability to perform differentiable query-key lookups. It is a common belief that the predictive power of networks leveraging softmax arises from "circuits" which sharply perform certain kinds of computations consistently across many diverse inputs. However, for these circuits to be robust, they would need to generalise well to arbitrary valid inputs. In this paper, we dispel this myth: even for tasks as simple as finding the maximum key, any learned circuitry must disperse as the number of items grows at test time. We attribute this to a fundamental limitation of the softmax function to robustly approximate sharp functions with increasing problem size, prove this phenomenon theoretically, and propose adaptive temperature as an ad-hoc technique for improving the sharpness of softmax at inference time.
APA
Veličković, P., Perivolaropoulos, C., Barbero, F. & Pascanu, R.. (2025). Softmax is not Enough (for Sharp Size Generalisation). Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:61190-61211 Available from https://proceedings.mlr.press/v267/velickovic25a.html.

Related Material