On The Computational Complexity of Self-Attention

Feyza Duman Keles, Pruthuvi Mahesakya Wijewardena, Chinmay Hegde
Proceedings of The 34th International Conference on Algorithmic Learning Theory, PMLR 201:597-619, 2023.

Abstract

Transformer architectures have led to remarkable progress in many state-of-art applications. However, despite their successes, modern transformers rely on the self-attention mechanism, whose time- and space-complexity is quadratic in the length of the input. Several approaches have been proposed to speed up self-attention mechanisms to achieve sub-quadratic running time; however, the large majority of these works are not accompanied by rigorous error guarantees. In this work, we establish lower bounds on the computational complexity of self-attention in a number of scenarios. We prove that the time complexity of self-attention is necessarily quadratic in the input length, unless the Strong Exponential Time Hypothesis (SETH) is false. This argument holds even if the attention computation is performed only approximately, and for a variety of attention mechanisms. As a complement to our lower bounds, we show that it is indeed possible to approximate dot-product self-attention using finite Taylor series in linear-time, at the cost of having an exponential dependence on the polynomial order.

Cite this Paper


BibTeX
@InProceedings{pmlr-v201-duman-keles23a, title = {On The Computational Complexity of Self-Attention}, author = {Duman Keles, Feyza and Wijewardena, Pruthuvi Mahesakya and Hegde, Chinmay}, booktitle = {Proceedings of The 34th International Conference on Algorithmic Learning Theory}, pages = {597--619}, year = {2023}, editor = {Agrawal, Shipra and Orabona, Francesco}, volume = {201}, series = {Proceedings of Machine Learning Research}, month = {20 Feb--23 Feb}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v201/duman-keles23a/duman-keles23a.pdf}, url = {https://proceedings.mlr.press/v201/duman-keles23a.html}, abstract = {Transformer architectures have led to remarkable progress in many state-of-art applications. However, despite their successes, modern transformers rely on the self-attention mechanism, whose time- and space-complexity is quadratic in the length of the input. Several approaches have been proposed to speed up self-attention mechanisms to achieve sub-quadratic running time; however, the large majority of these works are not accompanied by rigorous error guarantees. In this work, we establish lower bounds on the computational complexity of self-attention in a number of scenarios. We prove that the time complexity of self-attention is necessarily quadratic in the input length, unless the Strong Exponential Time Hypothesis (SETH) is false. This argument holds even if the attention computation is performed only approximately, and for a variety of attention mechanisms. As a complement to our lower bounds, we show that it is indeed possible to approximate dot-product self-attention using finite Taylor series in linear-time, at the cost of having an exponential dependence on the polynomial order.} }
Endnote
%0 Conference Paper %T On The Computational Complexity of Self-Attention %A Feyza Duman Keles %A Pruthuvi Mahesakya Wijewardena %A Chinmay Hegde %B Proceedings of The 34th International Conference on Algorithmic Learning Theory %C Proceedings of Machine Learning Research %D 2023 %E Shipra Agrawal %E Francesco Orabona %F pmlr-v201-duman-keles23a %I PMLR %P 597--619 %U https://proceedings.mlr.press/v201/duman-keles23a.html %V 201 %X Transformer architectures have led to remarkable progress in many state-of-art applications. However, despite their successes, modern transformers rely on the self-attention mechanism, whose time- and space-complexity is quadratic in the length of the input. Several approaches have been proposed to speed up self-attention mechanisms to achieve sub-quadratic running time; however, the large majority of these works are not accompanied by rigorous error guarantees. In this work, we establish lower bounds on the computational complexity of self-attention in a number of scenarios. We prove that the time complexity of self-attention is necessarily quadratic in the input length, unless the Strong Exponential Time Hypothesis (SETH) is false. This argument holds even if the attention computation is performed only approximately, and for a variety of attention mechanisms. As a complement to our lower bounds, we show that it is indeed possible to approximate dot-product self-attention using finite Taylor series in linear-time, at the cost of having an exponential dependence on the polynomial order.
APA
Duman Keles, F., Wijewardena, P.M. & Hegde, C.. (2023). On The Computational Complexity of Self-Attention. Proceedings of The 34th International Conference on Algorithmic Learning Theory, in Proceedings of Machine Learning Research 201:597-619 Available from https://proceedings.mlr.press/v201/duman-keles23a.html.

Related Material