The Surprising Effectiveness of Test-Time Training for Few-Shot Learning

Ekin Akyürek; Mehul Damani; Adam Zweiger; Linlu Qiu; Han Guo; Jyothish Pari; Yoon Kim; Jacob Andreas

The Surprising Effectiveness of Test-Time Training for Few-Shot Learning

Ekin Akyürek, Mehul Damani, Adam Zweiger, Linlu Qiu, Han Guo, Jyothish Pari, Yoon Kim, Jacob Andreas

Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:942-963, 2025.

Abstract

Language models (LMs) have shown impressive performance on tasks within their training distribution, but often struggle with structurally novel tasks even when given a small number of in-context task examples. We investigate the effectiveness of test-time training (TTT)—temporarily updating model parameters during inference using a loss derived from input data—as a mechanism for improving LMs’ reasoning and few-shot learning capabilities. On the Abstraction and Reasoning Corpus (ARC), performing TTT with in-context examples yields up to $6\times$ higher accuracy compared to fine-tuned baselines—reaching $53.0%$ on the public validation set with an 8B-parameter LM and $61.9%$ when ensembled with program-synthesis methods, matching average human performance. On BIG-Bench Hard (BBH), TTT on in-context examples surpasses standard few-shot prompting in the $10$-shot setting by $7.3$ percentage points ($50.5%$ to $57.8%$). Our findings highlight the limitations of in-context learning for novel tasks and demonstrate the potential of test-time training to enhance language model adaptability.

Cite this Paper

BibTeX

@InProceedings{pmlr-v267-akyurek25a,
  title = 	 {The Surprising Effectiveness of Test-Time Training for Few-Shot Learning},
  author =       {Aky\"{u}rek, Ekin and Damani, Mehul and Zweiger, Adam and Qiu, Linlu and Guo, Han and Pari, Jyothish and Kim, Yoon and Andreas, Jacob},
  booktitle = 	 {Proceedings of the 42nd International Conference on Machine Learning},
  pages = 	 {942--963},
  year = 	 {2025},
  editor = 	 {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry},
  volume = 	 {267},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {13--19 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v267/main/assets/akyurek25a/akyurek25a.pdf},
  url = 	 {https://proceedings.mlr.press/v267/akyurek25a.html},
  abstract = 	 {Language models (LMs) have shown impressive performance on tasks within their training distribution, but often struggle with structurally novel tasks even when given a small number of in-context task examples. We investigate the effectiveness of test-time training (TTT)—temporarily updating model parameters during inference using a loss derived from input data—as a mechanism for improving LMs’ reasoning and few-shot learning capabilities. On the Abstraction and Reasoning Corpus (ARC), performing TTT with in-context examples yields up to $6\times$ higher accuracy compared to fine-tuned baselines—reaching $53.0%$ on the public validation set with an 8B-parameter LM and $61.9%$ when ensembled with program-synthesis methods, matching average human performance. On BIG-Bench Hard (BBH), TTT on in-context examples surpasses standard few-shot prompting in the $10$-shot setting by $7.3$ percentage points ($50.5%$ to $57.8%$). Our findings highlight the limitations of in-context learning for novel tasks and demonstrate the potential of test-time training to enhance language model adaptability.}
}

Endnote

%0 Conference Paper
%T The Surprising Effectiveness of Test-Time Training for Few-Shot Learning
%A Ekin Akyürek
%A Mehul Damani
%A Adam Zweiger
%A Linlu Qiu
%A Han Guo
%A Jyothish Pari
%A Yoon Kim
%A Jacob Andreas
%B Proceedings of the 42nd International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2025
%E Aarti Singh
%E Maryam Fazel
%E Daniel Hsu
%E Simon Lacoste-Julien
%E Felix Berkenkamp
%E Tegan Maharaj
%E Kiri Wagstaff
%E Jerry Zhu	
%F pmlr-v267-akyurek25a
%I PMLR
%P 942--963
%U https://proceedings.mlr.press/v267/akyurek25a.html
%V 267
%X Language models (LMs) have shown impressive performance on tasks within their training distribution, but often struggle with structurally novel tasks even when given a small number of in-context task examples. We investigate the effectiveness of test-time training (TTT)—temporarily updating model parameters during inference using a loss derived from input data—as a mechanism for improving LMs’ reasoning and few-shot learning capabilities. On the Abstraction and Reasoning Corpus (ARC), performing TTT with in-context examples yields up to $6\times$ higher accuracy compared to fine-tuned baselines—reaching $53.0%$ on the public validation set with an 8B-parameter LM and $61.9%$ when ensembled with program-synthesis methods, matching average human performance. On BIG-Bench Hard (BBH), TTT on in-context examples surpasses standard few-shot prompting in the $10$-shot setting by $7.3$ percentage points ($50.5%$ to $57.8%$). Our findings highlight the limitations of in-context learning for novel tasks and demonstrate the potential of test-time training to enhance language model adaptability.

APA

Akyürek, E., Damani, M., Zweiger, A., Qiu, L., Guo, H., Pari, J., Kim, Y. & Andreas, J.. (2025). The Surprising Effectiveness of Test-Time Training for Few-Shot Learning. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:942-963 Available from https://proceedings.mlr.press/v267/akyurek25a.html.

The Surprising Effectiveness of Test-Time Training for Few-Shot Learning

Abstract

Cite this Paper

Related Material