Repeat After Me: Transformers are Better than State Space Models at Copying

Samy Jelassi, David Brandfonbrener, Sham M. Kakade, Eran Malach
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:21502-21521, 2024.

Abstract

Transformers are the dominant architecture for sequence modeling, but there is growing interest in models that use a fixed-size latent state that does not depend on the sequence length, which we refer to as ”generalized state space models” (GSSMs). In this paper we show that while GSSMs are promising in terms of inference-time efficiency, they are limited compared to transformer models on tasks that require copying from the input context. We start with a theoretical analysis of the simple task of string copying and prove that a two layer transformer can copy strings of exponential length while GSSMs are fundamentally limited by their fixed-size latent state. Empirically, we find that transformers outperform GSSMs in terms of efficiency and generalization on synthetic tasks that require copying the context. Finally, we evaluate pretrained large language models and find that transformer models dramatically outperform state space models at copying and retrieving information from context. Taken together, these results suggest a fundamental gap between transformers and GSSMs on tasks of practical interest.

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-jelassi24a, title = {Repeat After Me: Transformers are Better than State Space Models at Copying}, author = {Jelassi, Samy and Brandfonbrener, David and Kakade, Sham M. and Malach, Eran}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {21502--21521}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/jelassi24a/jelassi24a.pdf}, url = {https://proceedings.mlr.press/v235/jelassi24a.html}, abstract = {Transformers are the dominant architecture for sequence modeling, but there is growing interest in models that use a fixed-size latent state that does not depend on the sequence length, which we refer to as ”generalized state space models” (GSSMs). In this paper we show that while GSSMs are promising in terms of inference-time efficiency, they are limited compared to transformer models on tasks that require copying from the input context. We start with a theoretical analysis of the simple task of string copying and prove that a two layer transformer can copy strings of exponential length while GSSMs are fundamentally limited by their fixed-size latent state. Empirically, we find that transformers outperform GSSMs in terms of efficiency and generalization on synthetic tasks that require copying the context. Finally, we evaluate pretrained large language models and find that transformer models dramatically outperform state space models at copying and retrieving information from context. Taken together, these results suggest a fundamental gap between transformers and GSSMs on tasks of practical interest.} }
Endnote
%0 Conference Paper %T Repeat After Me: Transformers are Better than State Space Models at Copying %A Samy Jelassi %A David Brandfonbrener %A Sham M. Kakade %A Eran Malach %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-jelassi24a %I PMLR %P 21502--21521 %U https://proceedings.mlr.press/v235/jelassi24a.html %V 235 %X Transformers are the dominant architecture for sequence modeling, but there is growing interest in models that use a fixed-size latent state that does not depend on the sequence length, which we refer to as ”generalized state space models” (GSSMs). In this paper we show that while GSSMs are promising in terms of inference-time efficiency, they are limited compared to transformer models on tasks that require copying from the input context. We start with a theoretical analysis of the simple task of string copying and prove that a two layer transformer can copy strings of exponential length while GSSMs are fundamentally limited by their fixed-size latent state. Empirically, we find that transformers outperform GSSMs in terms of efficiency and generalization on synthetic tasks that require copying the context. Finally, we evaluate pretrained large language models and find that transformer models dramatically outperform state space models at copying and retrieving information from context. Taken together, these results suggest a fundamental gap between transformers and GSSMs on tasks of practical interest.
APA
Jelassi, S., Brandfonbrener, D., Kakade, S.M. & Malach, E.. (2024). Repeat After Me: Transformers are Better than State Space Models at Copying. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:21502-21521 Available from https://proceedings.mlr.press/v235/jelassi24a.html.

Related Material