Architecture-Agnostic Masked Image Modeling -- From ViT back to CNN

Siyuan Li, Di Wu, Fang Wu, Zelin Zang, Stan Z. Li
Proceedings of the 40th International Conference on Machine Learning, PMLR 202:20149-20167, 2023.

Abstract

Masked image modeling, an emerging self-supervised pre-training method, has shown impressive success across numerous downstream vision tasks with Vision transformers. Its underlying idea is simple: a portion of the input image is masked out and then reconstructed via a pre-text task. However, the working principle behind MIM is not well explained, and previous studies insist that MIM primarily works for the Transformer family but is incompatible with CNNs. In this work, we observe that MIM essentially teaches the model to learn better middle-order interactions among patches for more generalized feature extraction. We then propose an Architecture-Agnostic Masked Image Modeling framework (A$^2$MIM), which is compatible with both Transformers and CNNs in a unified way. Extensive experiments on popular benchmarks show that A$^2$MIM learns better representations without explicit design and endows the backbone model with the stronger capability to transfer to various downstream tasks.

Cite this Paper


BibTeX
@InProceedings{pmlr-v202-li23af, title = {Architecture-Agnostic Masked Image Modeling -- From {V}i{T} back to {CNN}}, author = {Li, Siyuan and Wu, Di and Wu, Fang and Zang, Zelin and Li, Stan Z.}, booktitle = {Proceedings of the 40th International Conference on Machine Learning}, pages = {20149--20167}, year = {2023}, editor = {Krause, Andreas and Brunskill, Emma and Cho, Kyunghyun and Engelhardt, Barbara and Sabato, Sivan and Scarlett, Jonathan}, volume = {202}, series = {Proceedings of Machine Learning Research}, month = {23--29 Jul}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v202/li23af/li23af.pdf}, url = {https://proceedings.mlr.press/v202/li23af.html}, abstract = {Masked image modeling, an emerging self-supervised pre-training method, has shown impressive success across numerous downstream vision tasks with Vision transformers. Its underlying idea is simple: a portion of the input image is masked out and then reconstructed via a pre-text task. However, the working principle behind MIM is not well explained, and previous studies insist that MIM primarily works for the Transformer family but is incompatible with CNNs. In this work, we observe that MIM essentially teaches the model to learn better middle-order interactions among patches for more generalized feature extraction. We then propose an Architecture-Agnostic Masked Image Modeling framework (A$^2$MIM), which is compatible with both Transformers and CNNs in a unified way. Extensive experiments on popular benchmarks show that A$^2$MIM learns better representations without explicit design and endows the backbone model with the stronger capability to transfer to various downstream tasks.} }
Endnote
%0 Conference Paper %T Architecture-Agnostic Masked Image Modeling -- From ViT back to CNN %A Siyuan Li %A Di Wu %A Fang Wu %A Zelin Zang %A Stan Z. Li %B Proceedings of the 40th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2023 %E Andreas Krause %E Emma Brunskill %E Kyunghyun Cho %E Barbara Engelhardt %E Sivan Sabato %E Jonathan Scarlett %F pmlr-v202-li23af %I PMLR %P 20149--20167 %U https://proceedings.mlr.press/v202/li23af.html %V 202 %X Masked image modeling, an emerging self-supervised pre-training method, has shown impressive success across numerous downstream vision tasks with Vision transformers. Its underlying idea is simple: a portion of the input image is masked out and then reconstructed via a pre-text task. However, the working principle behind MIM is not well explained, and previous studies insist that MIM primarily works for the Transformer family but is incompatible with CNNs. In this work, we observe that MIM essentially teaches the model to learn better middle-order interactions among patches for more generalized feature extraction. We then propose an Architecture-Agnostic Masked Image Modeling framework (A$^2$MIM), which is compatible with both Transformers and CNNs in a unified way. Extensive experiments on popular benchmarks show that A$^2$MIM learns better representations without explicit design and endows the backbone model with the stronger capability to transfer to various downstream tasks.
APA
Li, S., Wu, D., Wu, F., Zang, Z. & Li, S.Z.. (2023). Architecture-Agnostic Masked Image Modeling -- From ViT back to CNN. Proceedings of the 40th International Conference on Machine Learning, in Proceedings of Machine Learning Research 202:20149-20167 Available from https://proceedings.mlr.press/v202/li23af.html.

Related Material