On the Generalization Ability of Next-Token-Prediction Pretraining

Zhihao Li; Xue Jiang; Liyuan Liu; Xuelin Zhang; Hong Chen; Feng Zheng

On the Generalization Ability of Next-Token-Prediction Pretraining

Zhihao Li, Xue Jiang, Liyuan Liu, Xuelin Zhang, Hong Chen, Feng Zheng

Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:34943-34975, 2025.

Abstract

Large language models (LLMs) have demonstrated remarkable potential in handling natural language processing (NLP) tasks and beyond. LLMs usually can be categorized as transformer decoder-only models (DOMs), utilizing Next-Token-Prediction (NTP) as their pre-training methodology. Despite their tremendous empirical successes, the theoretical understanding of how NTP pre-training affects the model’s generalization behavior is lacking. To fill this gap, we establish the fine-grained generalization analysis for NTP pre-training based on Rademacher complexity, where the dependence between tokens is also addressed. Technically, a novel decomposition of Rademacher complexity is developed to study DOMs from the representation learner and the token predictor, respectively. Furthermore, the upper bounds of covering number are established for multi-layer and multi-head transformer-decoder models under the Frobenius norm, which theoretically pioneers the incorporation of mask matrix within the self-attention mechanism. Our results reveal that the generalization ability of NTP pre-training is affected quantitively by the number of token sequences $N$, the maximum length of sequence $m$, and the count of parameters in the transformer model $\Theta$. Additionally, experiments on public datasets verify our theoretical findings.

Cite this Paper

BibTeX

@InProceedings{pmlr-v267-li25ao,
  title = 	 {On the Generalization Ability of Next-Token-Prediction Pretraining},
  author =       {Li, Zhihao and Jiang, Xue and Liu, Liyuan and Zhang, Xuelin and Chen, Hong and Zheng, Feng},
  booktitle = 	 {Proceedings of the 42nd International Conference on Machine Learning},
  pages = 	 {34943--34975},
  year = 	 {2025},
  editor = 	 {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry},
  volume = 	 {267},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {13--19 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v267/main/assets/li25ao/li25ao.pdf},
  url = 	 {https://proceedings.mlr.press/v267/li25ao.html},
  abstract = 	 {Large language models (LLMs) have demonstrated remarkable potential in handling natural language processing (NLP) tasks and beyond. LLMs usually can be categorized as transformer decoder-only models (DOMs), utilizing Next-Token-Prediction (NTP) as their pre-training methodology. Despite their tremendous empirical successes, the theoretical understanding of how NTP pre-training affects the model’s generalization behavior is lacking. To fill this gap, we establish the fine-grained generalization analysis for NTP pre-training based on Rademacher complexity, where the dependence between tokens is also addressed. Technically, a novel decomposition of Rademacher complexity is developed to study DOMs from the representation learner and the token predictor, respectively. Furthermore, the upper bounds of covering number are established for multi-layer and multi-head transformer-decoder models under the Frobenius norm, which theoretically pioneers the incorporation of mask matrix within the self-attention mechanism. Our results reveal that the generalization ability of NTP pre-training is affected quantitively by the number of token sequences $N$, the maximum length of sequence $m$, and the count of parameters in the transformer model $\Theta$. Additionally, experiments on public datasets verify our theoretical findings.}
}

Endnote

%0 Conference Paper
%T On the Generalization Ability of Next-Token-Prediction Pretraining
%A Zhihao Li
%A Xue Jiang
%A Liyuan Liu
%A Xuelin Zhang
%A Hong Chen
%A Feng Zheng
%B Proceedings of the 42nd International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2025
%E Aarti Singh
%E Maryam Fazel
%E Daniel Hsu
%E Simon Lacoste-Julien
%E Felix Berkenkamp
%E Tegan Maharaj
%E Kiri Wagstaff
%E Jerry Zhu	
%F pmlr-v267-li25ao
%I PMLR
%P 34943--34975
%U https://proceedings.mlr.press/v267/li25ao.html
%V 267
%X Large language models (LLMs) have demonstrated remarkable potential in handling natural language processing (NLP) tasks and beyond. LLMs usually can be categorized as transformer decoder-only models (DOMs), utilizing Next-Token-Prediction (NTP) as their pre-training methodology. Despite their tremendous empirical successes, the theoretical understanding of how NTP pre-training affects the model’s generalization behavior is lacking. To fill this gap, we establish the fine-grained generalization analysis for NTP pre-training based on Rademacher complexity, where the dependence between tokens is also addressed. Technically, a novel decomposition of Rademacher complexity is developed to study DOMs from the representation learner and the token predictor, respectively. Furthermore, the upper bounds of covering number are established for multi-layer and multi-head transformer-decoder models under the Frobenius norm, which theoretically pioneers the incorporation of mask matrix within the self-attention mechanism. Our results reveal that the generalization ability of NTP pre-training is affected quantitively by the number of token sequences $N$, the maximum length of sequence $m$, and the count of parameters in the transformer model $\Theta$. Additionally, experiments on public datasets verify our theoretical findings.

APA

Li, Z., Jiang, X., Liu, L., Zhang, X., Chen, H. & Zheng, F.. (2025). On the Generalization Ability of Next-Token-Prediction Pretraining. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:34943-34975 Available from https://proceedings.mlr.press/v267/li25ao.html.

On the Generalization Ability of Next-Token-Prediction Pretraining

Abstract

Cite this Paper

Related Material