The Lipschitz Constant of Self-Attention

Hyunjik Kim, George Papamakarios, Andriy Mnih
Proceedings of the 38th International Conference on Machine Learning, PMLR 139:5562-5571, 2021.

Abstract

Lipschitz constants of neural networks have been explored in various contexts in deep learning, such as provable adversarial robustness, estimating Wasserstein distance, stabilising training of GANs, and formulating invertible neural networks. Such works have focused on bounding the Lipschitz constant of fully connected or convolutional networks, composed of linear maps and pointwise non-linearities. In this paper, we investigate the Lipschitz constant of self-attention, a non-linear neural network module widely used in sequence modelling. We prove that the standard dot-product self-attention is not Lipschitz for unbounded input domain, and propose an alternative L2 self-attention that is Lipschitz. We derive an upper bound on the Lipschitz constant of L2 self-attention and provide empirical evidence for its asymptotic tightness. To demonstrate the practical relevance of our theoretical work, we formulate invertible self-attention and use it in a Transformer-based architecture for a character-level language modelling task.

Cite this Paper


BibTeX
@InProceedings{pmlr-v139-kim21i, title = {The Lipschitz Constant of Self-Attention}, author = {Kim, Hyunjik and Papamakarios, George and Mnih, Andriy}, booktitle = {Proceedings of the 38th International Conference on Machine Learning}, pages = {5562--5571}, year = {2021}, editor = {Meila, Marina and Zhang, Tong}, volume = {139}, series = {Proceedings of Machine Learning Research}, month = {18--24 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v139/kim21i/kim21i.pdf}, url = {https://proceedings.mlr.press/v139/kim21i.html}, abstract = {Lipschitz constants of neural networks have been explored in various contexts in deep learning, such as provable adversarial robustness, estimating Wasserstein distance, stabilising training of GANs, and formulating invertible neural networks. Such works have focused on bounding the Lipschitz constant of fully connected or convolutional networks, composed of linear maps and pointwise non-linearities. In this paper, we investigate the Lipschitz constant of self-attention, a non-linear neural network module widely used in sequence modelling. We prove that the standard dot-product self-attention is not Lipschitz for unbounded input domain, and propose an alternative L2 self-attention that is Lipschitz. We derive an upper bound on the Lipschitz constant of L2 self-attention and provide empirical evidence for its asymptotic tightness. To demonstrate the practical relevance of our theoretical work, we formulate invertible self-attention and use it in a Transformer-based architecture for a character-level language modelling task.} }
Endnote
%0 Conference Paper %T The Lipschitz Constant of Self-Attention %A Hyunjik Kim %A George Papamakarios %A Andriy Mnih %B Proceedings of the 38th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2021 %E Marina Meila %E Tong Zhang %F pmlr-v139-kim21i %I PMLR %P 5562--5571 %U https://proceedings.mlr.press/v139/kim21i.html %V 139 %X Lipschitz constants of neural networks have been explored in various contexts in deep learning, such as provable adversarial robustness, estimating Wasserstein distance, stabilising training of GANs, and formulating invertible neural networks. Such works have focused on bounding the Lipschitz constant of fully connected or convolutional networks, composed of linear maps and pointwise non-linearities. In this paper, we investigate the Lipschitz constant of self-attention, a non-linear neural network module widely used in sequence modelling. We prove that the standard dot-product self-attention is not Lipschitz for unbounded input domain, and propose an alternative L2 self-attention that is Lipschitz. We derive an upper bound on the Lipschitz constant of L2 self-attention and provide empirical evidence for its asymptotic tightness. To demonstrate the practical relevance of our theoretical work, we formulate invertible self-attention and use it in a Transformer-based architecture for a character-level language modelling task.
APA
Kim, H., Papamakarios, G. & Mnih, A.. (2021). The Lipschitz Constant of Self-Attention. Proceedings of the 38th International Conference on Machine Learning, in Proceedings of Machine Learning Research 139:5562-5571 Available from https://proceedings.mlr.press/v139/kim21i.html.

Related Material