From block-Toeplitz matrices to differential equations on graphs: towards a general theory for scalable masked Transformers

Krzysztof Choromanski; Han Lin; Haoxian Chen; Tianyi Zhang; Arijit Sehanobish; Valerii Likhosherstov; Jack Parker-Holder; Tamas Sarlos; Adrian Weller; Thomas Weingarten

From block-Toeplitz matrices to differential equations on graphs: towards a general theory for scalable masked Transformers

Krzysztof Choromanski, Han Lin, Haoxian Chen, Tianyi Zhang, Arijit Sehanobish, Valerii Likhosherstov, Jack Parker-Holder, Tamas Sarlos, Adrian Weller, Thomas Weingarten

Proceedings of the 39th International Conference on Machine Learning, PMLR 162:3962-3983, 2022.

Abstract

In this paper we provide, to the best of our knowledge, the first comprehensive approach for incorporating various masking mechanisms into Transformers architectures in a scalable way. We show that recent results on linear causal attention (Choromanski et al., 2021) and log-linear RPE-attention (Luo et al., 2021) are special cases of this general mechanism. However by casting the problem as a topological (graph-based) modulation of unmasked attention, we obtain several results unknown before, including efficient d-dimensional RPE-masking and graph-kernel masking. We leverage many mathematical techniques ranging from spectral analysis through dynamic programming and random walks to new algorithms for solving Markov processes on graphs. We provide a corresponding empirical evaluation.

Cite this Paper

BibTeX

@InProceedings{pmlr-v162-choromanski22a,
  title = 	 {From block-Toeplitz matrices to differential equations on graphs: towards a general theory for scalable masked Transformers},
  author =       {Choromanski, Krzysztof and Lin, Han and Chen, Haoxian and Zhang, Tianyi and Sehanobish, Arijit and Likhosherstov, Valerii and Parker-Holder, Jack and Sarlos, Tamas and Weller, Adrian and Weingarten, Thomas},
  booktitle = 	 {Proceedings of the 39th International Conference on Machine Learning},
  pages = 	 {3962--3983},
  year = 	 {2022},
  editor = 	 {Chaudhuri, Kamalika and Jegelka, Stefanie and Song, Le and Szepesvari, Csaba and Niu, Gang and Sabato, Sivan},
  volume = 	 {162},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {17--23 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v162/choromanski22a/choromanski22a.pdf},
  url = 	 {https://proceedings.mlr.press/v162/choromanski22a.html},
  abstract = 	 {In this paper we provide, to the best of our knowledge, the first comprehensive approach for incorporating various masking mechanisms into Transformers architectures in a scalable way. We show that recent results on linear causal attention (Choromanski et al., 2021) and log-linear RPE-attention (Luo et al., 2021) are special cases of this general mechanism. However by casting the problem as a topological (graph-based) modulation of unmasked attention, we obtain several results unknown before, including efficient d-dimensional RPE-masking and graph-kernel masking. We leverage many mathematical techniques ranging from spectral analysis through dynamic programming and random walks to new algorithms for solving Markov processes on graphs. We provide a corresponding empirical evaluation.}
}

Endnote

%0 Conference Paper
%T From block-Toeplitz matrices to differential equations on graphs: towards a general theory for scalable masked Transformers
%A Krzysztof Choromanski
%A Han Lin
%A Haoxian Chen
%A Tianyi Zhang
%A Arijit Sehanobish
%A Valerii Likhosherstov
%A Jack Parker-Holder
%A Tamas Sarlos
%A Adrian Weller
%A Thomas Weingarten
%B Proceedings of the 39th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2022
%E Kamalika Chaudhuri
%E Stefanie Jegelka
%E Le Song
%E Csaba Szepesvari
%E Gang Niu
%E Sivan Sabato	
%F pmlr-v162-choromanski22a
%I PMLR
%P 3962--3983
%U https://proceedings.mlr.press/v162/choromanski22a.html
%V 162
%X In this paper we provide, to the best of our knowledge, the first comprehensive approach for incorporating various masking mechanisms into Transformers architectures in a scalable way. We show that recent results on linear causal attention (Choromanski et al., 2021) and log-linear RPE-attention (Luo et al., 2021) are special cases of this general mechanism. However by casting the problem as a topological (graph-based) modulation of unmasked attention, we obtain several results unknown before, including efficient d-dimensional RPE-masking and graph-kernel masking. We leverage many mathematical techniques ranging from spectral analysis through dynamic programming and random walks to new algorithms for solving Markov processes on graphs. We provide a corresponding empirical evaluation.

APA

Choromanski, K., Lin, H., Chen, H., Zhang, T., Sehanobish, A., Likhosherstov, V., Parker-Holder, J., Sarlos, T., Weller, A. & Weingarten, T.. (2022). From block-Toeplitz matrices to differential equations on graphs: towards a general theory for scalable masked Transformers. Proceedings of the 39th International Conference on Machine Learning, in Proceedings of Machine Learning Research 162:3962-3983 Available from https://proceedings.mlr.press/v162/choromanski22a.html.

Related Material

Download PDF