Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

Tri Dao; Albert Gu

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

Tri Dao, Albert Gu

Proceedings of the 41st International Conference on Machine Learning, PMLR 235:10041-10071, 2024.

Abstract

While Transformers have been the main architecture behind deep learning’s success in language modeling, state-space models (SSMs) such as Mamba have recently been shown to match or outperform Transformers at small to medium scale. We show that these families of models are actually quite closely related, and develop a rich framework of theoretical connections between SSMs and variants of attention, connected through various decompositions of a well-studied class of structured semiseparable matrices. Our state space duality (SSD) framework allows us to design a new architecture (Mamba-2) whose core layer is an a refinement of Mamba’s selective SSM that is 2-8

$\times$ faster, while continuing to be competitive with Transformers on language modeling.

Cite this Paper

BibTeX


@InProceedings{pmlr-v235-dao24a,
  title = 	 {Transformers are {SSM}s: Generalized Models and Efficient Algorithms Through Structured State Space Duality},
  author =       {Dao, Tri and Gu, Albert},
  booktitle = 	 {Proceedings of the 41st International Conference on Machine Learning},
  pages = 	 {10041--10071},
  year = 	 {2024},
  editor = 	 {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix},
  volume = 	 {235},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {21--27 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v235/main/assets/dao24a/dao24a.pdf},
  url = 	 {https://proceedings.mlr.press/v235/dao24a.html},
  abstract = 	 {While Transformers have been the main architecture behind deep learning’s success in language modeling, state-space models (SSMs) such as Mamba have recently been shown to match or outperform Transformers at small to medium scale. We show that these families of models are actually quite closely related, and develop a rich framework of theoretical connections between SSMs and variants of attention, connected through various decompositions of a well-studied class of structured semiseparable matrices. Our state space duality (SSD) framework allows us to design a new architecture (Mamba-2) whose core layer is an a refinement of Mamba’s selective SSM that is 2-8$\times$ faster, while continuing to be competitive with Transformers on language modeling.}
}

Endnote

%0 Conference Paper
%T Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality
%A Tri Dao
%A Albert Gu
%B Proceedings of the 41st International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2024
%E Ruslan Salakhutdinov
%E Zico Kolter
%E Katherine Heller
%E Adrian Weller
%E Nuria Oliver
%E Jonathan Scarlett
%E Felix Berkenkamp	
%F pmlr-v235-dao24a
%I PMLR
%P 10041--10071
%U https://proceedings.mlr.press/v235/dao24a.html
%V 235
%X While Transformers have been the main architecture behind deep learning’s success in language modeling, state-space models (SSMs) such as Mamba have recently been shown to match or outperform Transformers at small to medium scale. We show that these families of models are actually quite closely related, and develop a rich framework of theoretical connections between SSMs and variants of attention, connected through various decompositions of a well-studied class of structured semiseparable matrices. Our state space duality (SSD) framework allows us to design a new architecture (Mamba-2) whose core layer is an a refinement of Mamba’s selective SSM that is 2-8$\times$ faster, while continuing to be competitive with Transformers on language modeling.

APA


Dao, T. & Gu, A.. (2024). Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:10041-10071 Available from https://proceedings.mlr.press/v235/dao24a.html.

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

Abstract

Cite this Paper

Related Material