Universal Length Generalization with Turing Programs

Kaiying Hou; David Brandfonbrener; Sham M. Kakade; Samy Jelassi; Eran Malach

Universal Length Generalization with Turing Programs

Kaiying Hou, David Brandfonbrener, Sham M. Kakade, Samy Jelassi, Eran Malach

Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:23873-23893, 2025.

Abstract

Length generalization refers to the ability to extrapolate from short training sequences to long test sequences and is a challenge for current large language models. While prior work has proposed some architecture or data format changes to achieve length generalization, these proposals typically apply to a limited set of tasks. Building on prior scratchpad and Chain-of-Thought (CoT) techniques, we propose Turing Programs, a novel CoT strategy that decomposes an algorithmic task into steps mimicking the computation of a Turing Machine. This framework is both universal, as it can accommodate any algorithmic task, and simple, requiring only copying text from the context with small modifications. We show that by using Turing Programs, we obtain robust length generalization on a range of algorithmic tasks: addition, multiplication and in-context SGD. We then demonstrate that transformers achieve length generalization on random Turing Programs, suggesting that length generalization is possible for any algorithmic task. Finally, we theoretically prove that transformers can implement Turing Programs, constructing a simple RASP (Weiss et al.) program that simulates an arbitrary Turing machine.

Cite this Paper

BibTeX

@InProceedings{pmlr-v267-hou25a,
  title = 	 {Universal Length Generalization with Turing Programs},
  author =       {Hou, Kaiying and Brandfonbrener, David and Kakade, Sham M. and Jelassi, Samy and Malach, Eran},
  booktitle = 	 {Proceedings of the 42nd International Conference on Machine Learning},
  pages = 	 {23873--23893},
  year = 	 {2025},
  editor = 	 {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry},
  volume = 	 {267},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {13--19 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v267/main/assets/hou25a/hou25a.pdf},
  url = 	 {https://proceedings.mlr.press/v267/hou25a.html},
  abstract = 	 {Length generalization refers to the ability to extrapolate from short training sequences to long test sequences and is a challenge for current large language models. While prior work has proposed some architecture or data format changes to achieve length generalization, these proposals typically apply to a limited set of tasks. Building on prior scratchpad and Chain-of-Thought (CoT) techniques, we propose Turing Programs, a novel CoT strategy that decomposes an algorithmic task into steps mimicking the computation of a Turing Machine. This framework is both universal, as it can accommodate any algorithmic task, and simple, requiring only copying text from the context with small modifications. We show that by using Turing Programs, we obtain robust length generalization on a range of algorithmic tasks: addition, multiplication and in-context SGD. We then demonstrate that transformers achieve length generalization on random Turing Programs, suggesting that length generalization is possible for any algorithmic task. Finally, we theoretically prove that transformers can implement Turing Programs, constructing a simple RASP (Weiss et al.) program that simulates an arbitrary Turing machine.}
}

Endnote

%0 Conference Paper
%T Universal Length Generalization with Turing Programs
%A Kaiying Hou
%A David Brandfonbrener
%A Sham M. Kakade
%A Samy Jelassi
%A Eran Malach
%B Proceedings of the 42nd International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2025
%E Aarti Singh
%E Maryam Fazel
%E Daniel Hsu
%E Simon Lacoste-Julien
%E Felix Berkenkamp
%E Tegan Maharaj
%E Kiri Wagstaff
%E Jerry Zhu	
%F pmlr-v267-hou25a
%I PMLR
%P 23873--23893
%U https://proceedings.mlr.press/v267/hou25a.html
%V 267
%X Length generalization refers to the ability to extrapolate from short training sequences to long test sequences and is a challenge for current large language models. While prior work has proposed some architecture or data format changes to achieve length generalization, these proposals typically apply to a limited set of tasks. Building on prior scratchpad and Chain-of-Thought (CoT) techniques, we propose Turing Programs, a novel CoT strategy that decomposes an algorithmic task into steps mimicking the computation of a Turing Machine. This framework is both universal, as it can accommodate any algorithmic task, and simple, requiring only copying text from the context with small modifications. We show that by using Turing Programs, we obtain robust length generalization on a range of algorithmic tasks: addition, multiplication and in-context SGD. We then demonstrate that transformers achieve length generalization on random Turing Programs, suggesting that length generalization is possible for any algorithmic task. Finally, we theoretically prove that transformers can implement Turing Programs, constructing a simple RASP (Weiss et al.) program that simulates an arbitrary Turing machine.

APA

Hou, K., Brandfonbrener, D., Kakade, S.M., Jelassi, S. & Malach, E.. (2025). Universal Length Generalization with Turing Programs. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:23873-23893 Available from https://proceedings.mlr.press/v267/hou25a.html.

Universal Length Generalization with Turing Programs

Abstract

Cite this Paper

Related Material