On the Sample Complexity of Next-Token Prediction

Oğuz Kaan Yüksel; Nicolas Flammarion

On the Sample Complexity of Next-Token Prediction

Oğuz Kaan Yüksel, Nicolas Flammarion

Proceedings of The 28th International Conference on Artificial Intelligence and Statistics, PMLR 258:694-702, 2025.

Abstract

Next-token prediction with cross-entropy loss is the objective of choice in sequence and language modeling. Despite its widespread use, there is a lack of theoretical analysis regarding the generalization of models trained using this objective. In this work, we provide an analysis of empirical risk minimization for sequential inputs generated by order-$k$ Markov chains. Assuming bounded and Lipschitz logit functions, our results show that in-sample prediction error decays optimally with the number of tokens, whereas out-of-sample error incurs an additional term related to the mixing properties of the Markov chain. These rates depend on the statistical complexity of the hypothesis class and can lead to generalization errors that do not scale exponentially with the order of the Markov chain—unlike classical $k$-gram estimators. Finally, we discuss the possibility of achieving generalization rates independent of mixing.

Cite this Paper

BibTeX

@InProceedings{pmlr-v258-yuksel25a,
  title = 	 {On the Sample Complexity of Next-Token Prediction},
  author =       {Y{\"u}ksel, O{\u{g}}uz Kaan and Flammarion, Nicolas},
  booktitle = 	 {Proceedings of The 28th International Conference on Artificial Intelligence and Statistics},
  pages = 	 {694--702},
  year = 	 {2025},
  editor = 	 {Li, Yingzhen and Mandt, Stephan and Agrawal, Shipra and Khan, Emtiyaz},
  volume = 	 {258},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {03--05 May},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v258/main/assets/yuksel25a/yuksel25a.pdf},
  url = 	 {https://proceedings.mlr.press/v258/yuksel25a.html},
  abstract = 	 {Next-token prediction with cross-entropy loss is the objective of choice in sequence and language modeling. Despite its widespread use, there is a lack of theoretical analysis regarding the generalization of models trained using this objective. In this work, we provide an analysis of empirical risk minimization for sequential inputs generated by order-$k$ Markov chains. Assuming bounded and Lipschitz logit functions, our results show that in-sample prediction error decays optimally with the number of tokens, whereas out-of-sample error incurs an additional term related to the mixing properties of the Markov chain. These rates depend on the statistical complexity of the hypothesis class and can lead to generalization errors that do not scale exponentially with the order of the Markov chain—unlike classical $k$-gram estimators. Finally, we discuss the possibility of achieving generalization rates independent of mixing.}
}

Endnote

%0 Conference Paper
%T On the Sample Complexity of Next-Token Prediction
%A Oğuz Kaan Yüksel
%A Nicolas Flammarion
%B Proceedings of The 28th International Conference on Artificial Intelligence and Statistics
%C Proceedings of Machine Learning Research
%D 2025
%E Yingzhen Li
%E Stephan Mandt
%E Shipra Agrawal
%E Emtiyaz Khan	
%F pmlr-v258-yuksel25a
%I PMLR
%P 694--702
%U https://proceedings.mlr.press/v258/yuksel25a.html
%V 258
%X Next-token prediction with cross-entropy loss is the objective of choice in sequence and language modeling. Despite its widespread use, there is a lack of theoretical analysis regarding the generalization of models trained using this objective. In this work, we provide an analysis of empirical risk minimization for sequential inputs generated by order-$k$ Markov chains. Assuming bounded and Lipschitz logit functions, our results show that in-sample prediction error decays optimally with the number of tokens, whereas out-of-sample error incurs an additional term related to the mixing properties of the Markov chain. These rates depend on the statistical complexity of the hypothesis class and can lead to generalization errors that do not scale exponentially with the order of the Markov chain—unlike classical $k$-gram estimators. Finally, we discuss the possibility of achieving generalization rates independent of mixing.

APA

Yüksel, O.K. & Flammarion, N.. (2025). On the Sample Complexity of Next-Token Prediction. Proceedings of The 28th International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 258:694-702 Available from https://proceedings.mlr.press/v258/yuksel25a.html.

On the Sample Complexity of Next-Token Prediction

Abstract

Cite this Paper

Related Material