From block-Toeplitz matrices to differential equations on graphs: towards a general theory for scalable masked Transformers

Krzysztof Choromanski, Han Lin, Haoxian Chen, Tianyi Zhang, Arijit Sehanobish, Valerii Likhosherstov, Jack Parker-Holder, Tamas Sarlos, Adrian Weller, Thomas Weingarten
Proceedings of the 39th International Conference on Machine Learning, PMLR 162:3962-3983, 2022.

Abstract

In this paper we provide, to the best of our knowledge, the first comprehensive approach for incorporating various masking mechanisms into Transformers architectures in a scalable way. We show that recent results on linear causal attention (Choromanski et al., 2021) and log-linear RPE-attention (Luo et al., 2021) are special cases of this general mechanism. However by casting the problem as a topological (graph-based) modulation of unmasked attention, we obtain several results unknown before, including efficient d-dimensional RPE-masking and graph-kernel masking. We leverage many mathematical techniques ranging from spectral analysis through dynamic programming and random walks to new algorithms for solving Markov processes on graphs. We provide a corresponding empirical evaluation.

Cite this Paper


BibTeX
@InProceedings{pmlr-v162-choromanski22a, title = {From block-Toeplitz matrices to differential equations on graphs: towards a general theory for scalable masked Transformers}, author = {Choromanski, Krzysztof and Lin, Han and Chen, Haoxian and Zhang, Tianyi and Sehanobish, Arijit and Likhosherstov, Valerii and Parker-Holder, Jack and Sarlos, Tamas and Weller, Adrian and Weingarten, Thomas}, booktitle = {Proceedings of the 39th International Conference on Machine Learning}, pages = {3962--3983}, year = {2022}, editor = {Chaudhuri, Kamalika and Jegelka, Stefanie and Song, Le and Szepesvari, Csaba and Niu, Gang and Sabato, Sivan}, volume = {162}, series = {Proceedings of Machine Learning Research}, month = {17--23 Jul}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v162/choromanski22a/choromanski22a.pdf}, url = {https://proceedings.mlr.press/v162/choromanski22a.html}, abstract = {In this paper we provide, to the best of our knowledge, the first comprehensive approach for incorporating various masking mechanisms into Transformers architectures in a scalable way. We show that recent results on linear causal attention (Choromanski et al., 2021) and log-linear RPE-attention (Luo et al., 2021) are special cases of this general mechanism. However by casting the problem as a topological (graph-based) modulation of unmasked attention, we obtain several results unknown before, including efficient d-dimensional RPE-masking and graph-kernel masking. We leverage many mathematical techniques ranging from spectral analysis through dynamic programming and random walks to new algorithms for solving Markov processes on graphs. We provide a corresponding empirical evaluation.} }
Endnote
%0 Conference Paper %T From block-Toeplitz matrices to differential equations on graphs: towards a general theory for scalable masked Transformers %A Krzysztof Choromanski %A Han Lin %A Haoxian Chen %A Tianyi Zhang %A Arijit Sehanobish %A Valerii Likhosherstov %A Jack Parker-Holder %A Tamas Sarlos %A Adrian Weller %A Thomas Weingarten %B Proceedings of the 39th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2022 %E Kamalika Chaudhuri %E Stefanie Jegelka %E Le Song %E Csaba Szepesvari %E Gang Niu %E Sivan Sabato %F pmlr-v162-choromanski22a %I PMLR %P 3962--3983 %U https://proceedings.mlr.press/v162/choromanski22a.html %V 162 %X In this paper we provide, to the best of our knowledge, the first comprehensive approach for incorporating various masking mechanisms into Transformers architectures in a scalable way. We show that recent results on linear causal attention (Choromanski et al., 2021) and log-linear RPE-attention (Luo et al., 2021) are special cases of this general mechanism. However by casting the problem as a topological (graph-based) modulation of unmasked attention, we obtain several results unknown before, including efficient d-dimensional RPE-masking and graph-kernel masking. We leverage many mathematical techniques ranging from spectral analysis through dynamic programming and random walks to new algorithms for solving Markov processes on graphs. We provide a corresponding empirical evaluation.
APA
Choromanski, K., Lin, H., Chen, H., Zhang, T., Sehanobish, A., Likhosherstov, V., Parker-Holder, J., Sarlos, T., Weller, A. & Weingarten, T.. (2022). From block-Toeplitz matrices to differential equations on graphs: towards a general theory for scalable masked Transformers. Proceedings of the 39th International Conference on Machine Learning, in Proceedings of Machine Learning Research 162:3962-3983 Available from https://proceedings.mlr.press/v162/choromanski22a.html.

Related Material