Computational-Statistical Tradeoffs at the Next-Token Prediction Barrier: Autoregressive and Imitation Learning under Misspecification (extended abstract)

Dhruv Rohatgi; Adam Block; Audrey Huang; Akshay Krishnamurthy; Dylan J. Foster

Computational-Statistical Tradeoffs at the Next-Token Prediction Barrier: Autoregressive and Imitation Learning under Misspecification (extended abstract)

Dhruv Rohatgi, Adam Block, Audrey Huang, Akshay Krishnamurthy, Dylan J. Foster

Proceedings of Thirty Eighth Conference on Learning Theory, PMLR 291:4831-4837, 2025.

Abstract

Next-token prediction with the logarithmic loss is a cornerstone of autoregressive sequence modeling, but, in practice, suffers from \emph{error amplification}, where errors in the model compound and generation quality degrades as sequence length $H$ increases. From a theoretical perspective, this phenomenon should not appear in \emph{well-specified} settings, and, indeed, a growing body of empirical work hypothesizes that \emph{misspecification}, where the learner is not sufficiently expressive to represent the target distribution, may be the root cause. Under misspecification—where the goal is to learn as well as the best-in-class model up to a multiplicative approximation factor $C\geq{}1$—we confirm that $C$ indeed grows with $H$ for next-token prediction, lending theoretical support to this empirical hypothesis. We then ask whether this mode of error amplification is avoidable algorithmically, computationally, or information-theoretically, and uncover inherent computational-statistical tradeoffs. We show: \textbf{(1)} Information-theoretically, one can avoid error amplification and achieve $C=O(1)$. \textbf{(2)} Next-token prediction can be made robust to achieve $C=\tilde{O}(H)$, representing moderate error amplification, but this is an inherent barrier: \emph{any} next-token prediction-style objective must suffer $C=\Omega(H)$. \textbf{(3)} For the natural testbed of autoregressive \emph{linear} models, \emph{no computationally efficient algorithm} can achieve sub-polynomial approximation factor $C=e^{(\log H)^{1-\Omega(1)}}$; however, at least for binary token spaces, one can smoothly trade compute for statistical power and improve on $C=\Omega(H)$ in sub-exponential time. Our results have consequences in the more general setting of imitation learning, where the widely-used behavior cloning generalizes next-token prediction.

Cite this Paper

BibTeX

@InProceedings{pmlr-v291-rohatgi25a,
  title = 	 {Computational-Statistical Tradeoffs at the Next-Token Prediction Barrier: Autoregressive and Imitation Learning under Misspecification (extended abstract)},
  author =       {Rohatgi, Dhruv and Block, Adam and Huang, Audrey and Krishnamurthy, Akshay and Foster, Dylan J.},
  booktitle = 	 {Proceedings of Thirty Eighth Conference on Learning Theory},
  pages = 	 {4831--4837},
  year = 	 {2025},
  editor = 	 {Haghtalab, Nika and Moitra, Ankur},
  volume = 	 {291},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {30 Jun--04 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v291/main/assets/rohatgi25a/rohatgi25a.pdf},
  url = 	 {https://proceedings.mlr.press/v291/rohatgi25a.html},
  abstract = 	 {Next-token prediction with the logarithmic loss is a cornerstone of autoregressive sequence modeling, but, in practice, suffers from \emph{error amplification}, where errors in the model compound and generation quality degrades as sequence length $H$ increases. From a theoretical perspective, this phenomenon should not appear in \emph{well-specified} settings, and, indeed, a growing body of empirical work hypothesizes that \emph{misspecification}, where the learner is not sufficiently expressive to represent the target distribution, may be the root cause. Under misspecification—where the goal is to learn as well as the best-in-class model up to a multiplicative approximation factor $C\geq{}1$—we confirm that $C$ indeed grows with $H$ for next-token prediction, lending theoretical support to this empirical hypothesis. We then ask whether this mode of error amplification is avoidable algorithmically, computationally, or information-theoretically, and uncover inherent computational-statistical tradeoffs. We show: \textbf{(1)} Information-theoretically, one can avoid error amplification and achieve $C=O(1)$. \textbf{(2)} Next-token prediction can be made robust to achieve $C=\tilde{O}(H)$, representing moderate error amplification, but this is an inherent barrier: \emph{any} next-token prediction-style objective must suffer $C=\Omega(H)$. \textbf{(3)} For the natural testbed of autoregressive \emph{linear} models, \emph{no computationally efficient algorithm} can achieve sub-polynomial approximation factor $C=e^{(\log H)^{1-\Omega(1)}}$; however, at least for binary token spaces, one can smoothly trade compute for statistical power and improve on $C=\Omega(H)$ in sub-exponential time. Our results have consequences in the more general setting of imitation learning, where the widely-used behavior cloning generalizes next-token prediction.}
}

Endnote

%0 Conference Paper
%T Computational-Statistical Tradeoffs at the Next-Token Prediction Barrier: Autoregressive and Imitation Learning under Misspecification (extended abstract)
%A Dhruv Rohatgi
%A Adam Block
%A Audrey Huang
%A Akshay Krishnamurthy
%A Dylan J. Foster
%B Proceedings of Thirty Eighth Conference on Learning Theory
%C Proceedings of Machine Learning Research
%D 2025
%E Nika Haghtalab
%E Ankur Moitra	
%F pmlr-v291-rohatgi25a
%I PMLR
%P 4831--4837
%U https://proceedings.mlr.press/v291/rohatgi25a.html
%V 291
%X Next-token prediction with the logarithmic loss is a cornerstone of autoregressive sequence modeling, but, in practice, suffers from \emph{error amplification}, where errors in the model compound and generation quality degrades as sequence length $H$ increases. From a theoretical perspective, this phenomenon should not appear in \emph{well-specified} settings, and, indeed, a growing body of empirical work hypothesizes that \emph{misspecification}, where the learner is not sufficiently expressive to represent the target distribution, may be the root cause. Under misspecification—where the goal is to learn as well as the best-in-class model up to a multiplicative approximation factor $C\geq{}1$—we confirm that $C$ indeed grows with $H$ for next-token prediction, lending theoretical support to this empirical hypothesis. We then ask whether this mode of error amplification is avoidable algorithmically, computationally, or information-theoretically, and uncover inherent computational-statistical tradeoffs. We show: \textbf{(1)} Information-theoretically, one can avoid error amplification and achieve $C=O(1)$. \textbf{(2)} Next-token prediction can be made robust to achieve $C=\tilde{O}(H)$, representing moderate error amplification, but this is an inherent barrier: \emph{any} next-token prediction-style objective must suffer $C=\Omega(H)$. \textbf{(3)} For the natural testbed of autoregressive \emph{linear} models, \emph{no computationally efficient algorithm} can achieve sub-polynomial approximation factor $C=e^{(\log H)^{1-\Omega(1)}}$; however, at least for binary token spaces, one can smoothly trade compute for statistical power and improve on $C=\Omega(H)$ in sub-exponential time. Our results have consequences in the more general setting of imitation learning, where the widely-used behavior cloning generalizes next-token prediction.

APA

Rohatgi, D., Block, A., Huang, A., Krishnamurthy, A. & Foster, D.J.. (2025). Computational-Statistical Tradeoffs at the Next-Token Prediction Barrier: Autoregressive and Imitation Learning under Misspecification (extended abstract). Proceedings of Thirty Eighth Conference on Learning Theory, in Proceedings of Machine Learning Research 291:4831-4837 Available from https://proceedings.mlr.press/v291/rohatgi25a.html.

Related Material

Download PDF