Rejuvenating image-GPT as Strong Visual Representation Learners

Sucheng Ren, Zeyu Wang, Hongru Zhu, Junfei Xiao, Alan Yuille, Cihang Xie
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:42449-42461, 2024.

Abstract

This paper enhances image-GPT (iGPT), one of the pioneering works that introduce autoregressive pretraining to predict the next pixels for visual representation learning. Two simple yet essential changes are made. First, we shift the prediction target from raw pixels to semantic tokens, enabling a higher-level understanding of visual content. Second, we supplement the autoregressive modeling by instructing the model to predict not only the next tokens but also the visible tokens. This pipeline is particularly effective when semantic tokens are encoded by discriminatively trained models, such as CLIP. We introduce this novel approach as D-iGPT. Extensive experiments showcase that D-iGPT excels as a strong learner of visual representations: A notable achievement is its compelling performance on the ImageNet-1K dataset — by training on publicly available datasets, D-iGPT unprecedentedly achieves 90.0% top-1 accuracy with a vanilla ViT-H. Additionally, D-iGPT shows strong generalization on the downstream task. Code is available at https://github.com/OliverRensu/D-iGPT.

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-ren24d, title = {Rejuvenating image-{GPT} as Strong Visual Representation Learners}, author = {Ren, Sucheng and Wang, Zeyu and Zhu, Hongru and Xiao, Junfei and Yuille, Alan and Xie, Cihang}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {42449--42461}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/ren24d/ren24d.pdf}, url = {https://proceedings.mlr.press/v235/ren24d.html}, abstract = {This paper enhances image-GPT (iGPT), one of the pioneering works that introduce autoregressive pretraining to predict the next pixels for visual representation learning. Two simple yet essential changes are made. First, we shift the prediction target from raw pixels to semantic tokens, enabling a higher-level understanding of visual content. Second, we supplement the autoregressive modeling by instructing the model to predict not only the next tokens but also the visible tokens. This pipeline is particularly effective when semantic tokens are encoded by discriminatively trained models, such as CLIP. We introduce this novel approach as D-iGPT. Extensive experiments showcase that D-iGPT excels as a strong learner of visual representations: A notable achievement is its compelling performance on the ImageNet-1K dataset — by training on publicly available datasets, D-iGPT unprecedentedly achieves 90.0% top-1 accuracy with a vanilla ViT-H. Additionally, D-iGPT shows strong generalization on the downstream task. Code is available at https://github.com/OliverRensu/D-iGPT.} }
Endnote
%0 Conference Paper %T Rejuvenating image-GPT as Strong Visual Representation Learners %A Sucheng Ren %A Zeyu Wang %A Hongru Zhu %A Junfei Xiao %A Alan Yuille %A Cihang Xie %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-ren24d %I PMLR %P 42449--42461 %U https://proceedings.mlr.press/v235/ren24d.html %V 235 %X This paper enhances image-GPT (iGPT), one of the pioneering works that introduce autoregressive pretraining to predict the next pixels for visual representation learning. Two simple yet essential changes are made. First, we shift the prediction target from raw pixels to semantic tokens, enabling a higher-level understanding of visual content. Second, we supplement the autoregressive modeling by instructing the model to predict not only the next tokens but also the visible tokens. This pipeline is particularly effective when semantic tokens are encoded by discriminatively trained models, such as CLIP. We introduce this novel approach as D-iGPT. Extensive experiments showcase that D-iGPT excels as a strong learner of visual representations: A notable achievement is its compelling performance on the ImageNet-1K dataset — by training on publicly available datasets, D-iGPT unprecedentedly achieves 90.0% top-1 accuracy with a vanilla ViT-H. Additionally, D-iGPT shows strong generalization on the downstream task. Code is available at https://github.com/OliverRensu/D-iGPT.
APA
Ren, S., Wang, Z., Zhu, H., Xiao, J., Yuille, A. & Xie, C.. (2024). Rejuvenating image-GPT as Strong Visual Representation Learners. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:42449-42461 Available from https://proceedings.mlr.press/v235/ren24d.html.

Related Material