Mobile Attention: Mobile-Friendly Linear-Attention for Vision Transformers

Zhiyu Yao; Jian Wang; Haixu Wu; Jingdong Wang; Mingsheng Long

Mobile Attention: Mobile-Friendly Linear-Attention for Vision Transformers

Zhiyu Yao, Jian Wang, Haixu Wu, Jingdong Wang, Mingsheng Long

Proceedings of the 41st International Conference on Machine Learning, PMLR 235:56914-56926, 2024.

Abstract

Vision Transformers (ViTs) excel in computer vision tasks due to their ability to capture global context among tokens. However, their quadratic complexity

$\mathcal{O}(N^2D)$ in terms of token number

$N$ and feature dimension

$D$ limits practical use on mobile devices, necessitating more mobile-friendly ViTs with reduced latency. Multi-head linear-attention is emerging as a promising alternative with linear complexity

$\mathcal{O}(NDd)$ , where

$d$ is the per-head dimension. Still, more compute is needed as

$d$ gets large for model accuracy. Reducing

$d$ improves mobile friendliness at the expense of excessive small heads weak at learning valuable subspaces, ultimately impeding model capability. To overcome this efficiency-capability dilemma, we propose a novel Mobile-Attention design with a head-competition mechanism empowered by information flow, which prevents overemphasis on less important subspaces upon trivial heads while preserving essential subspaces to ensure Transformer’s capability. It enables linear-time complexity on mobile devices by supporting a small per-head dimension

$d$ for mobile efficiency. By replacing the standard attention of ViTs with Mobile-Attention, our optimized ViTs achieved enhanced model capacity and competitive performance in a range of computer vision tasks. Specifically, we have achieved remarkable reductions in latency on the iPhone 12. Code is available at https://github.com/thuml/MobileAttention.

Cite this Paper

BibTeX


@InProceedings{pmlr-v235-yao24c,
  title = 	 {Mobile Attention: Mobile-Friendly Linear-Attention for Vision Transformers},
  author =       {Yao, Zhiyu and Wang, Jian and Wu, Haixu and Wang, Jingdong and Long, Mingsheng},
  booktitle = 	 {Proceedings of the 41st International Conference on Machine Learning},
  pages = 	 {56914--56926},
  year = 	 {2024},
  editor = 	 {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix},
  volume = 	 {235},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {21--27 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v235/main/assets/yao24c/yao24c.pdf},
  url = 	 {https://proceedings.mlr.press/v235/yao24c.html},
  abstract = 	 {Vision Transformers (ViTs) excel in computer vision tasks due to their ability to capture global context among tokens. However, their quadratic complexity $\mathcal{O}(N^2D)$ in terms of token number $N$ and feature dimension $D$ limits practical use on mobile devices, necessitating more mobile-friendly ViTs with reduced latency. Multi-head linear-attention is emerging as a promising alternative with linear complexity $\mathcal{O}(NDd)$, where $d$ is the per-head dimension. Still, more compute is needed as $d$ gets large for model accuracy. Reducing $d$ improves mobile friendliness at the expense of excessive small heads weak at learning valuable subspaces, ultimately impeding model capability. To overcome this efficiency-capability dilemma, we propose a novel Mobile-Attention design with a head-competition mechanism empowered by information flow, which prevents overemphasis on less important subspaces upon trivial heads while preserving essential subspaces to ensure Transformer’s capability. It enables linear-time complexity on mobile devices by supporting a small per-head dimension $d$ for mobile efficiency. By replacing the standard attention of ViTs with Mobile-Attention, our optimized ViTs achieved enhanced model capacity and competitive performance in a range of computer vision tasks. Specifically, we have achieved remarkable reductions in latency on the iPhone 12. Code is available at https://github.com/thuml/MobileAttention.}
}

Endnote

%0 Conference Paper
%T Mobile Attention: Mobile-Friendly Linear-Attention for Vision Transformers
%A Zhiyu Yao
%A Jian Wang
%A Haixu Wu
%A Jingdong Wang
%A Mingsheng Long
%B Proceedings of the 41st International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2024
%E Ruslan Salakhutdinov
%E Zico Kolter
%E Katherine Heller
%E Adrian Weller
%E Nuria Oliver
%E Jonathan Scarlett
%E Felix Berkenkamp	
%F pmlr-v235-yao24c
%I PMLR
%P 56914--56926
%U https://proceedings.mlr.press/v235/yao24c.html
%V 235
%X Vision Transformers (ViTs) excel in computer vision tasks due to their ability to capture global context among tokens. However, their quadratic complexity $\mathcal{O}(N^2D)$ in terms of token number $N$ and feature dimension $D$ limits practical use on mobile devices, necessitating more mobile-friendly ViTs with reduced latency. Multi-head linear-attention is emerging as a promising alternative with linear complexity $\mathcal{O}(NDd)$, where $d$ is the per-head dimension. Still, more compute is needed as $d$ gets large for model accuracy. Reducing $d$ improves mobile friendliness at the expense of excessive small heads weak at learning valuable subspaces, ultimately impeding model capability. To overcome this efficiency-capability dilemma, we propose a novel Mobile-Attention design with a head-competition mechanism empowered by information flow, which prevents overemphasis on less important subspaces upon trivial heads while preserving essential subspaces to ensure Transformer’s capability. It enables linear-time complexity on mobile devices by supporting a small per-head dimension $d$ for mobile efficiency. By replacing the standard attention of ViTs with Mobile-Attention, our optimized ViTs achieved enhanced model capacity and competitive performance in a range of computer vision tasks. Specifically, we have achieved remarkable reductions in latency on the iPhone 12. Code is available at https://github.com/thuml/MobileAttention.

APA


Yao, Z., Wang, J., Wu, H., Wang, J. & Long, M.. (2024). Mobile Attention: Mobile-Friendly Linear-Attention for Vision Transformers. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:56914-56926 Available from https://proceedings.mlr.press/v235/yao24c.html.

Mobile Attention: Mobile-Friendly Linear-Attention for Vision Transformers

Abstract

Cite this Paper

Related Material