Momentum Transformer: Closing the Performance Gap Between Self-attention and Its Linearization

Tan Minh Nguyen, Richard Baraniuk, Robert Kirby, Stanley Osher, Bao Wang
Proceedings of Mathematical and Scientific Machine Learning, PMLR 190:189-204, 2022.

Abstract

Transformers have achieved remarkable success in sequence modeling and beyond but suffer from quadratic computational and memory complexities with respect to the length of the input sequence. Leveraging techniques include sparse and linear attention and hashing tricks; efficient transformers have been proposed to reduce the quadratic complexity of transformers but significantly degrade the accuracy. In response, we first interpret the linear attention and residual connections in computing the attention map as gradient descent steps. We then introduce momentum into these components and propose the \emph{momentum transformer}, which utilizes momentum to improve the accuracy of linear transformers while maintaining linear memory and computational complexities. Furthermore, we develop an adaptive strategy to compute the momentum value for our model based on the optimal momentum for quadratic optimization. This adaptive momentum eliminates the need to search for the optimal momentum value and further enhances the performance of the momentum transformer. A range of experiments on both autoregressive and non-autoregressive tasks, including image generation and machine translation, demonstrate that the momentum transformer outperforms popular linear transformers in training efficiency and accuracy.

Cite this Paper


BibTeX
@InProceedings{pmlr-v190-nguyen22a, title = {Momentum Transformer: Closing the Performance Gap Between Self-attention and Its Linearization}, author = {Nguyen, Tan Minh and Baraniuk, Richard and Kirby, Robert and Osher, Stanley and Wang, Bao}, booktitle = {Proceedings of Mathematical and Scientific Machine Learning}, pages = {189--204}, year = {2022}, editor = {Dong, Bin and Li, Qianxiao and Wang, Lei and Xu, Zhi-Qin John}, volume = {190}, series = {Proceedings of Machine Learning Research}, month = {15--17 Aug}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v190/nguyen22a/nguyen22a.pdf}, url = {https://proceedings.mlr.press/v190/nguyen22a.html}, abstract = {Transformers have achieved remarkable success in sequence modeling and beyond but suffer from quadratic computational and memory complexities with respect to the length of the input sequence. Leveraging techniques include sparse and linear attention and hashing tricks; efficient transformers have been proposed to reduce the quadratic complexity of transformers but significantly degrade the accuracy. In response, we first interpret the linear attention and residual connections in computing the attention map as gradient descent steps. We then introduce momentum into these components and propose the \emph{momentum transformer}, which utilizes momentum to improve the accuracy of linear transformers while maintaining linear memory and computational complexities. Furthermore, we develop an adaptive strategy to compute the momentum value for our model based on the optimal momentum for quadratic optimization. This adaptive momentum eliminates the need to search for the optimal momentum value and further enhances the performance of the momentum transformer. A range of experiments on both autoregressive and non-autoregressive tasks, including image generation and machine translation, demonstrate that the momentum transformer outperforms popular linear transformers in training efficiency and accuracy.} }
Endnote
%0 Conference Paper %T Momentum Transformer: Closing the Performance Gap Between Self-attention and Its Linearization %A Tan Minh Nguyen %A Richard Baraniuk %A Robert Kirby %A Stanley Osher %A Bao Wang %B Proceedings of Mathematical and Scientific Machine Learning %C Proceedings of Machine Learning Research %D 2022 %E Bin Dong %E Qianxiao Li %E Lei Wang %E Zhi-Qin John Xu %F pmlr-v190-nguyen22a %I PMLR %P 189--204 %U https://proceedings.mlr.press/v190/nguyen22a.html %V 190 %X Transformers have achieved remarkable success in sequence modeling and beyond but suffer from quadratic computational and memory complexities with respect to the length of the input sequence. Leveraging techniques include sparse and linear attention and hashing tricks; efficient transformers have been proposed to reduce the quadratic complexity of transformers but significantly degrade the accuracy. In response, we first interpret the linear attention and residual connections in computing the attention map as gradient descent steps. We then introduce momentum into these components and propose the \emph{momentum transformer}, which utilizes momentum to improve the accuracy of linear transformers while maintaining linear memory and computational complexities. Furthermore, we develop an adaptive strategy to compute the momentum value for our model based on the optimal momentum for quadratic optimization. This adaptive momentum eliminates the need to search for the optimal momentum value and further enhances the performance of the momentum transformer. A range of experiments on both autoregressive and non-autoregressive tasks, including image generation and machine translation, demonstrate that the momentum transformer outperforms popular linear transformers in training efficiency and accuracy.
APA
Nguyen, T.M., Baraniuk, R., Kirby, R., Osher, S. & Wang, B.. (2022). Momentum Transformer: Closing the Performance Gap Between Self-attention and Its Linearization. Proceedings of Mathematical and Scientific Machine Learning, in Proceedings of Machine Learning Research 190:189-204 Available from https://proceedings.mlr.press/v190/nguyen22a.html.

Related Material