Counting in Small Transformers: The Delicate Interplay between Attention and Feed-Forward Layers

Freya Behrens; Luca Biggio; Lenka Zdeborova

Counting in Small Transformers: The Delicate Interplay between Attention and Feed-Forward Layers

Freya Behrens, Luca Biggio, Lenka Zdeborova

Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:3500-3532, 2025.

Abstract

Next to scaling considerations, architectural design choices profoundly shape the solution space of transformers. In this work, we analyze the solutions simple transformer blocks implement when tackling the histogram task: counting items in sequences. Despite its simplicity, this task reveals a complex interplay between predictive performance, vocabulary and embedding sizes, token-mixing mechanisms, and feed-forward layer capacity. We identify two theoretical counting strategies transformers adopt, relation-based and inventory-based counting, each defining distinct learning regimes for the task. These strategies dictate how functionality is distributed between attention and feed-forward layers. We further show that adding softmax and beginning-of-sequence tokens allow for more robustness when embedding dimensions are comparatively small. Empirical introspection of trained models closely confirms both the learning regimes of the various architectures and the formation of these strategies during training. We demonstrate how a basic task that requires only aggregation and selection is significantly impacted by minor design changes.

Cite this Paper

BibTeX

@InProceedings{pmlr-v267-behrens25a,
  title = 	 {Counting in Small Transformers: The Delicate Interplay between Attention and Feed-Forward Layers},
  author =       {Behrens, Freya and Biggio, Luca and Zdeborova, Lenka},
  booktitle = 	 {Proceedings of the 42nd International Conference on Machine Learning},
  pages = 	 {3500--3532},
  year = 	 {2025},
  editor = 	 {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry},
  volume = 	 {267},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {13--19 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v267/main/assets/behrens25a/behrens25a.pdf},
  url = 	 {https://proceedings.mlr.press/v267/behrens25a.html},
  abstract = 	 {Next to scaling considerations, architectural design choices profoundly shape the solution space of transformers. In this work, we analyze the solutions simple transformer blocks implement when tackling the histogram task: counting items in sequences. Despite its simplicity, this task reveals a complex interplay between predictive performance, vocabulary and embedding sizes, token-mixing mechanisms, and feed-forward layer capacity. We identify two theoretical counting strategies transformers adopt, relation-based and inventory-based counting, each defining distinct learning regimes for the task. These strategies dictate how functionality is distributed between attention and feed-forward layers. We further show that adding softmax and beginning-of-sequence tokens allow for more robustness when embedding dimensions are comparatively small. Empirical introspection of trained models closely confirms both the learning regimes of the various architectures and the formation of these strategies during training. We demonstrate how a basic task that requires only aggregation and selection is significantly impacted by minor design changes.}
}

Endnote

%0 Conference Paper
%T Counting in Small Transformers: The Delicate Interplay between Attention and Feed-Forward Layers
%A Freya Behrens
%A Luca Biggio
%A Lenka Zdeborova
%B Proceedings of the 42nd International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2025
%E Aarti Singh
%E Maryam Fazel
%E Daniel Hsu
%E Simon Lacoste-Julien
%E Felix Berkenkamp
%E Tegan Maharaj
%E Kiri Wagstaff
%E Jerry Zhu	
%F pmlr-v267-behrens25a
%I PMLR
%P 3500--3532
%U https://proceedings.mlr.press/v267/behrens25a.html
%V 267
%X Next to scaling considerations, architectural design choices profoundly shape the solution space of transformers. In this work, we analyze the solutions simple transformer blocks implement when tackling the histogram task: counting items in sequences. Despite its simplicity, this task reveals a complex interplay between predictive performance, vocabulary and embedding sizes, token-mixing mechanisms, and feed-forward layer capacity. We identify two theoretical counting strategies transformers adopt, relation-based and inventory-based counting, each defining distinct learning regimes for the task. These strategies dictate how functionality is distributed between attention and feed-forward layers. We further show that adding softmax and beginning-of-sequence tokens allow for more robustness when embedding dimensions are comparatively small. Empirical introspection of trained models closely confirms both the learning regimes of the various architectures and the formation of these strategies during training. We demonstrate how a basic task that requires only aggregation and selection is significantly impacted by minor design changes.

APA

Behrens, F., Biggio, L. & Zdeborova, L.. (2025). Counting in Small Transformers: The Delicate Interplay between Attention and Feed-Forward Layers. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:3500-3532 Available from https://proceedings.mlr.press/v267/behrens25a.html.

Counting in Small Transformers: The Delicate Interplay between Attention and Feed-Forward Layers

Abstract

Cite this Paper

Related Material