Spatial-Channel Token Distillation for Vision MLPs

Yanxi Li; Xinghao Chen; Minjing Dong; Yehui Tang; Yunhe Wang; Chang Xu

Spatial-Channel Token Distillation for Vision MLPs

Yanxi Li, Xinghao Chen, Minjing Dong, Yehui Tang, Yunhe Wang, Chang Xu

Proceedings of the 39th International Conference on Machine Learning, PMLR 162:12685-12695, 2022.

Abstract

Recently, neural architectures with all Multi-layer Perceptrons (MLPs) have attracted great research interest from the computer vision community. However, the inefficient mixing of spatial-channel information causes MLP-like vision models to demand tremendous pre-training on large-scale datasets. This work solves the problem from a novel knowledge distillation perspective. We propose a novel Spatial-channel Token Distillation (STD) method, which improves the information mixing in the two dimensions by introducing distillation tokens to each of them. A mutual information regularization is further introduced to let distillation tokens focus on their specific dimensions and maximize the performance gain. Extensive experiments on ImageNet for several MLP-like architectures demonstrate that the proposed token distillation mechanism can efficiently improve the accuracy. For example, the proposed STD boosts the top-1 accuracy of Mixer-S16 on ImageNet from 73.8% to 75.7% without any costly pre-training on JFT-300M. When applied to stronger architectures, e.g. CycleMLP-B1 and CycleMLP-B2, STD can still harvest about 1.1% and 0.5% accuracy gains, respectively.

Cite this Paper

BibTeX


@InProceedings{pmlr-v162-li22c,
  title = 	 {Spatial-Channel Token Distillation for Vision {MLP}s},
  author =       {Li, Yanxi and Chen, Xinghao and Dong, Minjing and Tang, Yehui and Wang, Yunhe and Xu, Chang},
  booktitle = 	 {Proceedings of the 39th International Conference on Machine Learning},
  pages = 	 {12685--12695},
  year = 	 {2022},
  editor = 	 {Chaudhuri, Kamalika and Jegelka, Stefanie and Song, Le and Szepesvari, Csaba and Niu, Gang and Sabato, Sivan},
  volume = 	 {162},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {17--23 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v162/li22c/li22c.pdf},
  url = 	 {https://proceedings.mlr.press/v162/li22c.html},
  abstract = 	 {Recently, neural architectures with all Multi-layer Perceptrons (MLPs) have attracted great research interest from the computer vision community. However, the inefficient mixing of spatial-channel information causes MLP-like vision models to demand tremendous pre-training on large-scale datasets. This work solves the problem from a novel knowledge distillation perspective. We propose a novel Spatial-channel Token Distillation (STD) method, which improves the information mixing in the two dimensions by introducing distillation tokens to each of them. A mutual information regularization is further introduced to let distillation tokens focus on their specific dimensions and maximize the performance gain. Extensive experiments on ImageNet for several MLP-like architectures demonstrate that the proposed token distillation mechanism can efficiently improve the accuracy. For example, the proposed STD boosts the top-1 accuracy of Mixer-S16 on ImageNet from 73.8% to 75.7% without any costly pre-training on JFT-300M. When applied to stronger architectures, e.g. CycleMLP-B1 and CycleMLP-B2, STD can still harvest about 1.1% and 0.5% accuracy gains, respectively.}
}

Endnote

%0 Conference Paper
%T Spatial-Channel Token Distillation for Vision MLPs
%A Yanxi Li
%A Xinghao Chen
%A Minjing Dong
%A Yehui Tang
%A Yunhe Wang
%A Chang Xu
%B Proceedings of the 39th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2022
%E Kamalika Chaudhuri
%E Stefanie Jegelka
%E Le Song
%E Csaba Szepesvari
%E Gang Niu
%E Sivan Sabato	
%F pmlr-v162-li22c
%I PMLR
%P 12685--12695
%U https://proceedings.mlr.press/v162/li22c.html
%V 162
%X Recently, neural architectures with all Multi-layer Perceptrons (MLPs) have attracted great research interest from the computer vision community. However, the inefficient mixing of spatial-channel information causes MLP-like vision models to demand tremendous pre-training on large-scale datasets. This work solves the problem from a novel knowledge distillation perspective. We propose a novel Spatial-channel Token Distillation (STD) method, which improves the information mixing in the two dimensions by introducing distillation tokens to each of them. A mutual information regularization is further introduced to let distillation tokens focus on their specific dimensions and maximize the performance gain. Extensive experiments on ImageNet for several MLP-like architectures demonstrate that the proposed token distillation mechanism can efficiently improve the accuracy. For example, the proposed STD boosts the top-1 accuracy of Mixer-S16 on ImageNet from 73.8% to 75.7% without any costly pre-training on JFT-300M. When applied to stronger architectures, e.g. CycleMLP-B1 and CycleMLP-B2, STD can still harvest about 1.1% and 0.5% accuracy gains, respectively.

APA


Li, Y., Chen, X., Dong, M., Tang, Y., Wang, Y. & Xu, C.. (2022). Spatial-Channel Token Distillation for Vision MLPs. Proceedings of the 39th International Conference on Machine Learning, in Proceedings of Machine Learning Research 162:12685-12695 Available from https://proceedings.mlr.press/v162/li22c.html.

Related Material

Download PDF