Memory Layers at Scale

Vincent-Pierre Berges; Barlas Oguz; Daniel Haziza; Wen-Tau Yih; Luke Zettlemoyer; Gargi Ghosh

Memory Layers at Scale

Vincent-Pierre Berges, Barlas Oguz, Daniel Haziza, Wen-Tau Yih, Luke Zettlemoyer, Gargi Ghosh

Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:3831-3842, 2025.

Abstract

Memory layers use a trainable key-value lookup mechanism to add extra parameters to a model without increasing FLOPs. Conceptually, sparsely activated memory layers complement compute-heavy dense feed-forward layers, providing dedicated capacity to store and retrieve information cheaply. This work takes memory layers beyond proof-of-concept, proving their utility at contemporary scale. On downstream tasks, language models augmented with our improved memory layer outperform dense models with more than twice the computation budget, as well as mixture-of-expert models when matched for both compute and parameters. We find gains are especially pronounced for factual tasks. We provide a fully parallelizable memory layer implementation, demonstrating scaling laws with up to 128B memory parameters, pretrained to 1 trillion tokens, comparing to base models with up to 8B parameters.

Cite this Paper

BibTeX

@InProceedings{pmlr-v267-berges25a,
  title = 	 {Memory Layers at Scale},
  author =       {Berges, Vincent-Pierre and Oguz, Barlas and Haziza, Daniel and Yih, Wen-Tau and Zettlemoyer, Luke and Ghosh, Gargi},
  booktitle = 	 {Proceedings of the 42nd International Conference on Machine Learning},
  pages = 	 {3831--3842},
  year = 	 {2025},
  editor = 	 {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry},
  volume = 	 {267},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {13--19 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v267/main/assets/berges25a/berges25a.pdf},
  url = 	 {https://proceedings.mlr.press/v267/berges25a.html},
  abstract = 	 {Memory layers use a trainable key-value lookup mechanism to add extra parameters to a model without increasing FLOPs. Conceptually, sparsely activated memory layers complement compute-heavy dense feed-forward layers, providing dedicated capacity to store and retrieve information cheaply. This work takes memory layers beyond proof-of-concept, proving their utility at contemporary scale. On downstream tasks, language models augmented with our improved memory layer outperform dense models with more than twice the computation budget, as well as mixture-of-expert models when matched for both compute and parameters. We find gains are especially pronounced for factual tasks. We provide a fully parallelizable memory layer implementation, demonstrating scaling laws with up to 128B memory parameters, pretrained to 1 trillion tokens, comparing to base models with up to 8B parameters.}
}

Endnote

%0 Conference Paper
%T Memory Layers at Scale
%A Vincent-Pierre Berges
%A Barlas Oguz
%A Daniel Haziza
%A Wen-Tau Yih
%A Luke Zettlemoyer
%A Gargi Ghosh
%B Proceedings of the 42nd International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2025
%E Aarti Singh
%E Maryam Fazel
%E Daniel Hsu
%E Simon Lacoste-Julien
%E Felix Berkenkamp
%E Tegan Maharaj
%E Kiri Wagstaff
%E Jerry Zhu	
%F pmlr-v267-berges25a
%I PMLR
%P 3831--3842
%U https://proceedings.mlr.press/v267/berges25a.html
%V 267
%X Memory layers use a trainable key-value lookup mechanism to add extra parameters to a model without increasing FLOPs. Conceptually, sparsely activated memory layers complement compute-heavy dense feed-forward layers, providing dedicated capacity to store and retrieve information cheaply. This work takes memory layers beyond proof-of-concept, proving their utility at contemporary scale. On downstream tasks, language models augmented with our improved memory layer outperform dense models with more than twice the computation budget, as well as mixture-of-expert models when matched for both compute and parameters. We find gains are especially pronounced for factual tasks. We provide a fully parallelizable memory layer implementation, demonstrating scaling laws with up to 128B memory parameters, pretrained to 1 trillion tokens, comparing to base models with up to 8B parameters.

APA

Berges, V., Oguz, B., Haziza, D., Yih, W., Zettlemoyer, L. & Ghosh, G.. (2025). Memory Layers at Scale. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:3831-3842 Available from https://proceedings.mlr.press/v267/berges25a.html.

Memory Layers at Scale

Abstract

Cite this Paper

Related Material