A Theory of Learning with Autoregressive Chain of Thought

Nirmit Joshi; Gal Vardi; Adam Block; Surbhi Goel; Zhiyuan Li; Theodor Misiakiewicz; Nathan Srebro

A Theory of Learning with Autoregressive Chain of Thought

Nirmit Joshi, Gal Vardi, Adam Block, Surbhi Goel, Zhiyuan Li, Theodor Misiakiewicz, Nathan Srebro

Proceedings of Thirty Eighth Conference on Learning Theory, PMLR 291:3161-3212, 2025.

Abstract

For a given base class of sequence-to-next-token generators, we consider learning prompt-to-answer mappings obtained by iterating a fixed, time-invariant generator for multiple steps, thus generating a chain-of-thought, and then taking the final token as the answer. We formalize the learning problems both when the chain-of-thought is observed and when training only on prompt-answer pairs, with the chain-of-thought latent. We analyze the sample and computational complexity both in terms of general properties of the base class (e.g. its VC dimension) and for specific base classes such as linear thresholds. We present a simple base class that allows for universal representability and computationally tractable chain-of-thought learning. Central to our development is that time invariance allows for sample complexity that is independent of the length of the chain-of-thought. Attention arises naturally in our construction.

Cite this Paper

BibTeX

@InProceedings{pmlr-v291-joshi25a,
  title = 	 {A Theory of Learning with Autoregressive Chain of Thought},
  author =       {Joshi, Nirmit and Vardi, Gal and Block, Adam and Goel, Surbhi and Li, Zhiyuan and Misiakiewicz, Theodor and Srebro, Nathan},
  booktitle = 	 {Proceedings of Thirty Eighth Conference on Learning Theory},
  pages = 	 {3161--3212},
  year = 	 {2025},
  editor = 	 {Haghtalab, Nika and Moitra, Ankur},
  volume = 	 {291},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {30 Jun--04 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v291/main/assets/joshi25a/joshi25a.pdf},
  url = 	 {https://proceedings.mlr.press/v291/joshi25a.html},
  abstract = 	 {For a given base class of sequence-to-next-token generators, we consider learning prompt-to-answer mappings obtained by iterating a fixed, time-invariant generator for multiple steps, thus generating a chain-of-thought, and then taking the final token as the answer. We formalize the learning problems both when the chain-of-thought is observed and when training only on prompt-answer pairs, with the chain-of-thought latent. We analyze the sample and computational complexity both in terms of general properties of the base class (e.g. its VC dimension) and for specific base classes such as linear thresholds. We present a simple base class that allows for universal representability and computationally tractable chain-of-thought learning. Central to our development is that time invariance allows for sample complexity that is independent of the length of the chain-of-thought. Attention arises naturally in our construction.}
}

Endnote

%0 Conference Paper
%T A Theory of Learning with Autoregressive Chain of Thought
%A Nirmit Joshi
%A Gal Vardi
%A Adam Block
%A Surbhi Goel
%A Zhiyuan Li
%A Theodor Misiakiewicz
%A Nathan Srebro
%B Proceedings of Thirty Eighth Conference on Learning Theory
%C Proceedings of Machine Learning Research
%D 2025
%E Nika Haghtalab
%E Ankur Moitra	
%F pmlr-v291-joshi25a
%I PMLR
%P 3161--3212
%U https://proceedings.mlr.press/v291/joshi25a.html
%V 291
%X For a given base class of sequence-to-next-token generators, we consider learning prompt-to-answer mappings obtained by iterating a fixed, time-invariant generator for multiple steps, thus generating a chain-of-thought, and then taking the final token as the answer. We formalize the learning problems both when the chain-of-thought is observed and when training only on prompt-answer pairs, with the chain-of-thought latent. We analyze the sample and computational complexity both in terms of general properties of the base class (e.g. its VC dimension) and for specific base classes such as linear thresholds. We present a simple base class that allows for universal representability and computationally tractable chain-of-thought learning. Central to our development is that time invariance allows for sample complexity that is independent of the length of the chain-of-thought. Attention arises naturally in our construction.

APA

Joshi, N., Vardi, G., Block, A., Goel, S., Li, Z., Misiakiewicz, T. & Srebro, N.. (2025). A Theory of Learning with Autoregressive Chain of Thought. Proceedings of Thirty Eighth Conference on Learning Theory, in Proceedings of Machine Learning Research 291:3161-3212 Available from https://proceedings.mlr.press/v291/joshi25a.html.

Related Material

Download PDF