Vision Transformer with High Spatial Structure Sensitivity

Zhiwei Xu

Vision Transformer with High Spatial Structure Sensitivity

Zhiwei Xu

Proceedings of the 16th Asian Conference on Machine Learning, PMLR 260:735-749, 2025.

Abstract

Self-attention operation, the core operation of the vision transformer (VT), is position-independent. Therefore, VT uses positional embedding to encode spatial information. However, we found that the role of positional encoding is very limited, and VT is insensitive to spatial structure. We demonstrated a significant sensitivity gap to random block shuffling and masking between VT and convolutional neural network (CNN), which indicates that VT does not learn the spatial structure of the target well and focuses too much on small-scale detail features. We argue that self-attention should use position-dependent operations to encode spatial information instead of relying on positional embedding. We replace the linear projection of self-attention with convolution operation and use regular receptive field for each feature point, which significantly increases VT’s sensitivity to spatial structure without sacrificing performance.

Cite this Paper

BibTeX

@InProceedings{pmlr-v260-xu25b,
  title = 	 {Vision Transformer with High Spatial Structure Sensitivity},
  author =       {Xu, Zhiwei},
  booktitle = 	 {Proceedings of the 16th Asian Conference on Machine Learning},
  pages = 	 {735--749},
  year = 	 {2025},
  editor = 	 {Nguyen, Vu and Lin, Hsuan-Tien},
  volume = 	 {260},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {05--08 Dec},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v260/main/assets/xu25b/xu25b.pdf},
  url = 	 {https://proceedings.mlr.press/v260/xu25b.html},
  abstract = 	 {Self-attention operation, the core operation of the vision transformer (VT), is position-independent. Therefore, VT uses positional embedding to encode spatial information. However, we found that the role of positional encoding is very limited, and VT is insensitive to spatial structure. We demonstrated a significant sensitivity gap to random block shuffling and masking between VT and convolutional neural network (CNN), which indicates that VT does not learn the spatial structure of the target well and focuses too much on small-scale detail features.
 We argue that self-attention should use position-dependent operations to encode spatial information instead of relying on positional embedding. We replace the linear projection of self-attention with convolution operation and use regular receptive field for each feature point, which significantly increases VT’s sensitivity to spatial structure without sacrificing performance.}
}

Endnote

%0 Conference Paper
%T Vision Transformer with High Spatial Structure Sensitivity
%A Zhiwei Xu
%B Proceedings of the 16th Asian Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2025
%E Vu Nguyen
%E Hsuan-Tien Lin	
%F pmlr-v260-xu25b
%I PMLR
%P 735--749
%U https://proceedings.mlr.press/v260/xu25b.html
%V 260
%X Self-attention operation, the core operation of the vision transformer (VT), is position-independent. Therefore, VT uses positional embedding to encode spatial information. However, we found that the role of positional encoding is very limited, and VT is insensitive to spatial structure. We demonstrated a significant sensitivity gap to random block shuffling and masking between VT and convolutional neural network (CNN), which indicates that VT does not learn the spatial structure of the target well and focuses too much on small-scale detail features.
 We argue that self-attention should use position-dependent operations to encode spatial information instead of relying on positional embedding. We replace the linear projection of self-attention with convolution operation and use regular receptive field for each feature point, which significantly increases VT’s sensitivity to spatial structure without sacrificing performance.

APA

Xu, Z.. (2025). Vision Transformer with High Spatial Structure Sensitivity. Proceedings of the 16th Asian Conference on Machine Learning, in Proceedings of Machine Learning Research 260:735-749 Available from https://proceedings.mlr.press/v260/xu25b.html.

Vision Transformer with High Spatial Structure Sensitivity

Abstract

Cite this Paper

Related Material