In-Context Learning and Occam’s Razor

Eric Elmoznino; Tom Marty; Tejas Kasetty; Leo Gagnon; Sarthak Mittal; Mahan Fathi; Dhanya Sridhar; Guillaume Lajoie

In-Context Learning and Occam’s Razor

Eric Elmoznino, Tom Marty, Tejas Kasetty, Leo Gagnon, Sarthak Mittal, Mahan Fathi, Dhanya Sridhar, Guillaume Lajoie

Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:15296-15319, 2025.

Abstract

A central goal of machine learning is generalization. While the No Free Lunch Theorem states that we cannot obtain theoretical guarantees for generalization without further assumptions, in practice we observe that simple models which explain the training data generalize best—a principle called Occam’s razor. Despite the need for simple models, most current approaches in machine learning only minimize the training error, and at best indirectly promote simplicity through regularization or architecture design. Here, we draw a connection between Occam’s razor and in-context learning—an emergent ability of certain sequence models like Transformers to learn at inference time from past observations in a sequence. In particular, we show that the next-token prediction loss used to train in-context learners is directly equivalent to a data compression technique called prequential coding, and that minimizing this loss amounts to jointly minimizing both the training error and the complexity of the model that was implicitly learned from context. Our theory and the empirical experiments we use to support it not only provide a normative account of in-context learning, but also elucidate the shortcomings of current in-context learning methods, suggesting ways in which they can be improved. We make our code available at https://github.com/3rdCore/PrequentialCode.

Cite this Paper

BibTeX

@InProceedings{pmlr-v267-elmoznino25b,
  title = 	 {In-Context Learning and Occam’s Razor},
  author =       {Elmoznino, Eric and Marty, Tom and Kasetty, Tejas and Gagnon, Leo and Mittal, Sarthak and Fathi, Mahan and Sridhar, Dhanya and Lajoie, Guillaume},
  booktitle = 	 {Proceedings of the 42nd International Conference on Machine Learning},
  pages = 	 {15296--15319},
  year = 	 {2025},
  editor = 	 {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry},
  volume = 	 {267},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {13--19 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v267/main/assets/elmoznino25b/elmoznino25b.pdf},
  url = 	 {https://proceedings.mlr.press/v267/elmoznino25b.html},
  abstract = 	 {A central goal of machine learning is generalization. While the No Free Lunch Theorem states that we cannot obtain theoretical guarantees for generalization without further assumptions, in practice we observe that simple models which explain the training data generalize best—a principle called Occam’s razor. Despite the need for simple models, most current approaches in machine learning only minimize the training error, and at best indirectly promote simplicity through regularization or architecture design. Here, we draw a connection between Occam’s razor and in-context learning—an emergent ability of certain sequence models like Transformers to learn at inference time from past observations in a sequence. In particular, we show that the next-token prediction loss used to train in-context learners is directly equivalent to a data compression technique called prequential coding, and that minimizing this loss amounts to jointly minimizing both the training error and the complexity of the model that was implicitly learned from context. Our theory and the empirical experiments we use to support it not only provide a normative account of in-context learning, but also elucidate the shortcomings of current in-context learning methods, suggesting ways in which they can be improved. We make our code available at https://github.com/3rdCore/PrequentialCode.}
}

Endnote

%0 Conference Paper
%T In-Context Learning and Occam’s Razor
%A Eric Elmoznino
%A Tom Marty
%A Tejas Kasetty
%A Leo Gagnon
%A Sarthak Mittal
%A Mahan Fathi
%A Dhanya Sridhar
%A Guillaume Lajoie
%B Proceedings of the 42nd International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2025
%E Aarti Singh
%E Maryam Fazel
%E Daniel Hsu
%E Simon Lacoste-Julien
%E Felix Berkenkamp
%E Tegan Maharaj
%E Kiri Wagstaff
%E Jerry Zhu	
%F pmlr-v267-elmoznino25b
%I PMLR
%P 15296--15319
%U https://proceedings.mlr.press/v267/elmoznino25b.html
%V 267
%X A central goal of machine learning is generalization. While the No Free Lunch Theorem states that we cannot obtain theoretical guarantees for generalization without further assumptions, in practice we observe that simple models which explain the training data generalize best—a principle called Occam’s razor. Despite the need for simple models, most current approaches in machine learning only minimize the training error, and at best indirectly promote simplicity through regularization or architecture design. Here, we draw a connection between Occam’s razor and in-context learning—an emergent ability of certain sequence models like Transformers to learn at inference time from past observations in a sequence. In particular, we show that the next-token prediction loss used to train in-context learners is directly equivalent to a data compression technique called prequential coding, and that minimizing this loss amounts to jointly minimizing both the training error and the complexity of the model that was implicitly learned from context. Our theory and the empirical experiments we use to support it not only provide a normative account of in-context learning, but also elucidate the shortcomings of current in-context learning methods, suggesting ways in which they can be improved. We make our code available at https://github.com/3rdCore/PrequentialCode.

APA

Elmoznino, E., Marty, T., Kasetty, T., Gagnon, L., Mittal, S., Fathi, M., Sridhar, D. & Lajoie, G.. (2025). In-Context Learning and Occam’s Razor. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:15296-15319 Available from https://proceedings.mlr.press/v267/elmoznino25b.html.

In-Context Learning and Occam’s Razor

Abstract

Cite this Paper

Related Material