$E(2)$-Equivariant Vision Transformer

Renjun Xu; Kaifan Yang; Ke Liu; Fengxiang He

$E(2)$ -Equivariant Vision Transformer

Renjun Xu, Kaifan Yang, Ke Liu, Fengxiang He

Proceedings of the Thirty-Ninth Conference on Uncertainty in Artificial Intelligence, PMLR 216:2356-2366, 2023.

Abstract

Vision Transformer (ViT) has achieved remarkable performance in computer vision. However, positional encoding in ViT makes it substantially difficult to learn the intrinsic equivariance in data. Ini- tial attempts have been made on designing equiv- ariant ViT but are proved defective in some cases in this paper. To address this issue, we design a Group Equivariant Vision Transformer (GE-ViT) via a novel, effective positional encoding opera- tor. We prove that GE-ViT meets all the theoreti- cal requirements of an equivariant neural network. Comprehensive experiments are conducted on standard benchmark datasets, demonstrating that GE-ViT significantly outperforms non-equivariant self-attention networks. The code is available at https://github.com/ZJUCDSYangKaifan/GEVit.

Cite this Paper

BibTeX


@InProceedings{pmlr-v216-xu23b,
  title = 	 {$E(2)$-Equivariant Vision Transformer},
  author =       {Xu, Renjun and Yang, Kaifan and Liu, Ke and He, Fengxiang},
  booktitle = 	 {Proceedings of the Thirty-Ninth Conference on Uncertainty in Artificial Intelligence},
  pages = 	 {2356--2366},
  year = 	 {2023},
  editor = 	 {Evans, Robin J. and Shpitser, Ilya},
  volume = 	 {216},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {31 Jul--04 Aug},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v216/xu23b/xu23b.pdf},
  url = 	 {https://proceedings.mlr.press/v216/xu23b.html},
  abstract = 	 {Vision Transformer (ViT) has achieved remarkable performance in computer vision. However, positional encoding in ViT makes it substantially difficult to learn the intrinsic equivariance in data. Ini- tial attempts have been made on designing equiv- ariant ViT but are proved defective in some cases in this paper. To address this issue, we design a Group Equivariant Vision Transformer (GE-ViT) via a novel, effective positional encoding opera- tor. We prove that GE-ViT meets all the theoreti- cal requirements of an equivariant neural network. Comprehensive experiments are conducted on standard benchmark datasets, demonstrating that GE-ViT significantly outperforms non-equivariant self-attention networks. The code is available at https://github.com/ZJUCDSYangKaifan/GEVit.}
}

Endnote

%0 Conference Paper
%T $E(2)$-Equivariant Vision Transformer
%A Renjun Xu
%A Kaifan Yang
%A Ke Liu
%A Fengxiang He
%B Proceedings of the Thirty-Ninth Conference on Uncertainty in Artificial Intelligence
%C Proceedings of Machine Learning Research
%D 2023
%E Robin J. Evans
%E Ilya Shpitser	
%F pmlr-v216-xu23b
%I PMLR
%P 2356--2366
%U https://proceedings.mlr.press/v216/xu23b.html
%V 216
%X Vision Transformer (ViT) has achieved remarkable performance in computer vision. However, positional encoding in ViT makes it substantially difficult to learn the intrinsic equivariance in data. Ini- tial attempts have been made on designing equiv- ariant ViT but are proved defective in some cases in this paper. To address this issue, we design a Group Equivariant Vision Transformer (GE-ViT) via a novel, effective positional encoding opera- tor. We prove that GE-ViT meets all the theoreti- cal requirements of an equivariant neural network. Comprehensive experiments are conducted on standard benchmark datasets, demonstrating that GE-ViT significantly outperforms non-equivariant self-attention networks. The code is available at https://github.com/ZJUCDSYangKaifan/GEVit.

APA


Xu, R., Yang, K., Liu, K. & He, F.. (2023). $E(2)$-Equivariant Vision Transformer. Proceedings of the Thirty-Ninth Conference on Uncertainty in Artificial Intelligence, in Proceedings of Machine Learning Research 216:2356-2366 Available from https://proceedings.mlr.press/v216/xu23b.html.

E(2)E(2)-Equivariant Vision Transformer

Abstract

Cite this Paper

Related Material

$E(2)$ -Equivariant Vision Transformer