Looking Beyond the Top-1: Transformers Determine Top Tokens in Order

Daria Lioubashevski, Tomer M. Schlank, Gabriel Stanovsky, Ariel Goldstein
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:38029-38048, 2025.

Abstract

Uncovering the inner mechanisms of Transformer models offers insights into how they process and represent information. In this work, we analyze the computation performed by Transformers in the layers after the top-1 prediction remains fixed, known as the “saturation event”. We expand this concept to top-k tokens, demonstrating that similar saturation events occur across language, vision, and speech models. We find that these events occur in order of the corresponding tokens’ ranking, i.e., the model first decides on the top ranking token, then the second highest ranking token, and so on. This phenomenon seems intrinsic to the Transformer architecture, occurring across different variants, and even in untrained Transformers. We propose that these events reflect task transitions, where determining each token corresponds to a discrete task. We show that it is possible to predict the current task from hidden layer embedding, and demonstrate that we can cause the model to switch to the next task via intervention. Leveraging our findings, we introduce a token-level early-exit strategy, surpassing existing methods in balancing performance and efficiency and show how to exploit saturation events for better language modeling.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-lioubashevski25a, title = {Looking Beyond the Top-1: Transformers Determine Top Tokens in Order}, author = {Lioubashevski, Daria and Schlank, Tomer M. and Stanovsky, Gabriel and Goldstein, Ariel}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {38029--38048}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/lioubashevski25a/lioubashevski25a.pdf}, url = {https://proceedings.mlr.press/v267/lioubashevski25a.html}, abstract = {Uncovering the inner mechanisms of Transformer models offers insights into how they process and represent information. In this work, we analyze the computation performed by Transformers in the layers after the top-1 prediction remains fixed, known as the “saturation event”. We expand this concept to top-k tokens, demonstrating that similar saturation events occur across language, vision, and speech models. We find that these events occur in order of the corresponding tokens’ ranking, i.e., the model first decides on the top ranking token, then the second highest ranking token, and so on. This phenomenon seems intrinsic to the Transformer architecture, occurring across different variants, and even in untrained Transformers. We propose that these events reflect task transitions, where determining each token corresponds to a discrete task. We show that it is possible to predict the current task from hidden layer embedding, and demonstrate that we can cause the model to switch to the next task via intervention. Leveraging our findings, we introduce a token-level early-exit strategy, surpassing existing methods in balancing performance and efficiency and show how to exploit saturation events for better language modeling.} }
Endnote
%0 Conference Paper %T Looking Beyond the Top-1: Transformers Determine Top Tokens in Order %A Daria Lioubashevski %A Tomer M. Schlank %A Gabriel Stanovsky %A Ariel Goldstein %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-lioubashevski25a %I PMLR %P 38029--38048 %U https://proceedings.mlr.press/v267/lioubashevski25a.html %V 267 %X Uncovering the inner mechanisms of Transformer models offers insights into how they process and represent information. In this work, we analyze the computation performed by Transformers in the layers after the top-1 prediction remains fixed, known as the “saturation event”. We expand this concept to top-k tokens, demonstrating that similar saturation events occur across language, vision, and speech models. We find that these events occur in order of the corresponding tokens’ ranking, i.e., the model first decides on the top ranking token, then the second highest ranking token, and so on. This phenomenon seems intrinsic to the Transformer architecture, occurring across different variants, and even in untrained Transformers. We propose that these events reflect task transitions, where determining each token corresponds to a discrete task. We show that it is possible to predict the current task from hidden layer embedding, and demonstrate that we can cause the model to switch to the next task via intervention. Leveraging our findings, we introduce a token-level early-exit strategy, surpassing existing methods in balancing performance and efficiency and show how to exploit saturation events for better language modeling.
APA
Lioubashevski, D., Schlank, T.M., Stanovsky, G. & Goldstein, A.. (2025). Looking Beyond the Top-1: Transformers Determine Top Tokens in Order. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:38029-38048 Available from https://proceedings.mlr.press/v267/lioubashevski25a.html.

Related Material