The Role of Sparsity for Length Generalization in LLMs

Noah Golowich; Samy Jelassi; David Brandfonbrener; Sham M. Kakade; Eran Malach

The Role of Sparsity for Length Generalization in LLMs

Noah Golowich, Samy Jelassi, David Brandfonbrener, Sham M. Kakade, Eran Malach

Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:19809-19840, 2025.

Abstract

Training large language models to predict beyond their training context lengths has drawn much attention in recent years, yet the principles driving such behavior of length generalization remain underexplored. We propose a new theoretical framework to study length generalization for the next-token prediction task, as performed by decoder-only transformers. Conceptually, we show that length generalization occurs as long as each predicted token depends on a small (fixed) number of previous tokens. We formalize such tasks via a notion we call k-sparse planted correlation distributions, and show that an idealized model of transformers which generalize attention heads successfully length-generalize on such tasks. As a bonus, our theoretical model allows us to provide justifications for techniques to modify positional embeddings which have been introduced to improve length generalization, such as position coupling. We support our theoretical results with experiments on synthetic tasks and natural language, which confirm that a key factor driving length generalization is indeed a “sparse” dependency structure of each token on the previous ones. Further, inspired by our theory, we introduce Predictive Position Coupling, a generalization of position coupling which trains the transformer to predict the position IDs used in a positional coupling approach. Predictive Position Coupling thereby allows us to broaden the array of tasks to which Position Coupling can successfully be applied to achieve length generalization.

Cite this Paper

BibTeX

@InProceedings{pmlr-v267-golowich25a,
  title = 	 {The Role of Sparsity for Length Generalization in {LLM}s},
  author =       {Golowich, Noah and Jelassi, Samy and Brandfonbrener, David and Kakade, Sham M. and Malach, Eran},
  booktitle = 	 {Proceedings of the 42nd International Conference on Machine Learning},
  pages = 	 {19809--19840},
  year = 	 {2025},
  editor = 	 {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry},
  volume = 	 {267},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {13--19 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v267/main/assets/golowich25a/golowich25a.pdf},
  url = 	 {https://proceedings.mlr.press/v267/golowich25a.html},
  abstract = 	 {Training large language models to predict beyond their training context lengths has drawn much attention in recent years, yet the principles driving such behavior of length generalization remain underexplored. We propose a new theoretical framework to study length generalization for the next-token prediction task, as performed by decoder-only transformers. Conceptually, we show that length generalization occurs as long as each predicted token depends on a small (fixed) number of previous tokens. We formalize such tasks via a notion we call k-sparse planted correlation distributions, and show that an idealized model of transformers which generalize attention heads successfully length-generalize on such tasks. As a bonus, our theoretical model allows us to provide justifications for techniques to modify positional embeddings which have been introduced to improve length generalization, such as position coupling. We support our theoretical results with experiments on synthetic tasks and natural language, which confirm that a key factor driving length generalization is indeed a “sparse” dependency structure of each token on the previous ones. Further, inspired by our theory, we introduce Predictive Position Coupling, a generalization of position coupling which trains the transformer to predict the position IDs used in a positional coupling approach. Predictive Position Coupling thereby allows us to broaden the array of tasks to which Position Coupling can successfully be applied to achieve length generalization.}
}

Endnote

%0 Conference Paper
%T The Role of Sparsity for Length Generalization in LLMs
%A Noah Golowich
%A Samy Jelassi
%A David Brandfonbrener
%A Sham M. Kakade
%A Eran Malach
%B Proceedings of the 42nd International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2025
%E Aarti Singh
%E Maryam Fazel
%E Daniel Hsu
%E Simon Lacoste-Julien
%E Felix Berkenkamp
%E Tegan Maharaj
%E Kiri Wagstaff
%E Jerry Zhu	
%F pmlr-v267-golowich25a
%I PMLR
%P 19809--19840
%U https://proceedings.mlr.press/v267/golowich25a.html
%V 267
%X Training large language models to predict beyond their training context lengths has drawn much attention in recent years, yet the principles driving such behavior of length generalization remain underexplored. We propose a new theoretical framework to study length generalization for the next-token prediction task, as performed by decoder-only transformers. Conceptually, we show that length generalization occurs as long as each predicted token depends on a small (fixed) number of previous tokens. We formalize such tasks via a notion we call k-sparse planted correlation distributions, and show that an idealized model of transformers which generalize attention heads successfully length-generalize on such tasks. As a bonus, our theoretical model allows us to provide justifications for techniques to modify positional embeddings which have been introduced to improve length generalization, such as position coupling. We support our theoretical results with experiments on synthetic tasks and natural language, which confirm that a key factor driving length generalization is indeed a “sparse” dependency structure of each token on the previous ones. Further, inspired by our theory, we introduce Predictive Position Coupling, a generalization of position coupling which trains the transformer to predict the position IDs used in a positional coupling approach. Predictive Position Coupling thereby allows us to broaden the array of tasks to which Position Coupling can successfully be applied to achieve length generalization.

APA

Golowich, N., Jelassi, S., Brandfonbrener, D., Kakade, S.M. & Malach, E.. (2025). The Role of Sparsity for Length Generalization in LLMs. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:19809-19840 Available from https://proceedings.mlr.press/v267/golowich25a.html.

The Role of Sparsity for Length Generalization in LLMs

Abstract

Cite this Paper

Related Material