Residual Matrix Transformers: Scaling the Size of the Residual Stream

Brian Mak; Jeffrey Flanigan

Residual Matrix Transformers: Scaling the Size of the Residual Stream

Brian Mak, Jeffrey Flanigan

Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:42712-42729, 2025.

Abstract

The residual stream acts as a memory bus where transformer layers both store and access features (Elhage et al., 2021). We consider changing the mechanism for retrieving and storing information in the residual stream, and replace the residual stream of the transformer with an outer product memory matrix (Kohonen, 1972, Anderson, 1972). We call this model the Residual Matrix Transformer (RMT). We find that the RMT enjoys a number of attractive properties: 1) the size of the residual stream can be scaled independently of compute and model size, improving performance, 2) the RMT can achieve the same loss as the transformer with 58% fewer FLOPS, 25% fewer parameters, and 41% fewer training tokens tokens, and 3) the RMT outperforms the transformer on downstream evaluations. We theoretically analyze the transformer and the RMT, and show that the RMT allows for more efficient scaling of the residual stream, as well as improved variance propagation properties.

Cite this Paper

BibTeX

@InProceedings{pmlr-v267-mak25a,
  title = 	 {Residual Matrix Transformers: Scaling the Size of the Residual Stream},
  author =       {Mak, Brian and Flanigan, Jeffrey},
  booktitle = 	 {Proceedings of the 42nd International Conference on Machine Learning},
  pages = 	 {42712--42729},
  year = 	 {2025},
  editor = 	 {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry},
  volume = 	 {267},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {13--19 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v267/main/assets/mak25a/mak25a.pdf},
  url = 	 {https://proceedings.mlr.press/v267/mak25a.html},
  abstract = 	 {The residual stream acts as a memory bus where transformer layers both store and access features (Elhage et al., 2021). We consider changing the mechanism for retrieving and storing information in the residual stream, and replace the residual stream of the transformer with an outer product memory matrix (Kohonen, 1972, Anderson, 1972). We call this model the Residual Matrix Transformer (RMT). We find that the RMT enjoys a number of attractive properties: 1) the size of the residual stream can be scaled independently of compute and model size, improving performance, 2) the RMT can achieve the same loss as the transformer with 58% fewer FLOPS, 25% fewer parameters, and 41% fewer training tokens tokens, and 3) the RMT outperforms the transformer on downstream evaluations. We theoretically analyze the transformer and the RMT, and show that the RMT allows for more efficient scaling of the residual stream, as well as improved variance propagation properties.}
}

Endnote

%0 Conference Paper
%T Residual Matrix Transformers: Scaling the Size of the Residual Stream
%A Brian Mak
%A Jeffrey Flanigan
%B Proceedings of the 42nd International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2025
%E Aarti Singh
%E Maryam Fazel
%E Daniel Hsu
%E Simon Lacoste-Julien
%E Felix Berkenkamp
%E Tegan Maharaj
%E Kiri Wagstaff
%E Jerry Zhu	
%F pmlr-v267-mak25a
%I PMLR
%P 42712--42729
%U https://proceedings.mlr.press/v267/mak25a.html
%V 267
%X The residual stream acts as a memory bus where transformer layers both store and access features (Elhage et al., 2021). We consider changing the mechanism for retrieving and storing information in the residual stream, and replace the residual stream of the transformer with an outer product memory matrix (Kohonen, 1972, Anderson, 1972). We call this model the Residual Matrix Transformer (RMT). We find that the RMT enjoys a number of attractive properties: 1) the size of the residual stream can be scaled independently of compute and model size, improving performance, 2) the RMT can achieve the same loss as the transformer with 58% fewer FLOPS, 25% fewer parameters, and 41% fewer training tokens tokens, and 3) the RMT outperforms the transformer on downstream evaluations. We theoretically analyze the transformer and the RMT, and show that the RMT allows for more efficient scaling of the residual stream, as well as improved variance propagation properties.

APA

Mak, B. & Flanigan, J.. (2025). Residual Matrix Transformers: Scaling the Size of the Residual Stream. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:42712-42729 Available from https://proceedings.mlr.press/v267/mak25a.html.

Residual Matrix Transformers: Scaling the Size of the Residual Stream

Abstract

Cite this Paper

Related Material