Generative Pretraining From Pixels

Mark Chen; Alec Radford; Rewon Child; Jeffrey Wu; Heewoo Jun; David Luan; Ilya Sutskever

Generative Pretraining From Pixels

Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, Ilya Sutskever

Proceedings of the 37th International Conference on Machine Learning, PMLR 119:1691-1703, 2020.

Abstract

Inspired by progress in unsupervised representation learning for natural language, we examine whether similar models can learn useful representations for images. We train a sequence Transformer to auto-regressively predict pixels, without incorporating knowledge of the 2D input structure. Despite training on low-resolution ImageNet without labels, we find that a GPT-2 scale model learns strong image representations as measured by linear probing, fine-tuning, and low-data classification. On CIFAR-10, we achieve 96.3% accuracy with a linear probe, outperforming a supervised Wide ResNet, and 99.0% accuracy with full fine-tuning, matching the top supervised pre-trained models. We are also competitive with self-supervised benchmarks on ImageNet when substituting pixels for a VQVAE encoding, achieving 69.0% top-1 accuracy on a linear probe of our features.

Cite this Paper

BibTeX

@InProceedings{pmlr-v119-chen20s,
  title = 	 {Generative Pretraining From Pixels},
  author =       {Chen, Mark and Radford, Alec and Child, Rewon and Wu, Jeffrey and Jun, Heewoo and Luan, David and Sutskever, Ilya},
  booktitle = 	 {Proceedings of the 37th International Conference on Machine Learning},
  pages = 	 {1691--1703},
  year = 	 {2020},
  editor = 	 {III, Hal Daumé and Singh, Aarti},
  volume = 	 {119},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {13--18 Jul},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v119/chen20s/chen20s.pdf},
  url = 	 {https://proceedings.mlr.press/v119/chen20s.html},
  abstract = 	 {Inspired by progress in unsupervised representation learning for natural language, we examine whether similar models can learn useful representations for images. We train a sequence Transformer to auto-regressively predict pixels, without incorporating knowledge of the 2D input structure. Despite training on low-resolution ImageNet without labels, we find that a GPT-2 scale model learns strong image representations as measured by linear probing, fine-tuning, and low-data classification. On CIFAR-10, we achieve 96.3% accuracy with a linear probe, outperforming a supervised Wide ResNet, and 99.0% accuracy with full fine-tuning, matching the top supervised pre-trained models. We are also competitive with self-supervised benchmarks on ImageNet when substituting pixels for a VQVAE encoding, achieving 69.0% top-1 accuracy on a linear probe of our features.}
}

Endnote

%0 Conference Paper
%T Generative Pretraining From Pixels
%A Mark Chen
%A Alec Radford
%A Rewon Child
%A Jeffrey Wu
%A Heewoo Jun
%A David Luan
%A Ilya Sutskever
%B Proceedings of the 37th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2020
%E Hal Daumé III
%E Aarti Singh	
%F pmlr-v119-chen20s
%I PMLR
%P 1691--1703
%U https://proceedings.mlr.press/v119/chen20s.html
%V 119
%X Inspired by progress in unsupervised representation learning for natural language, we examine whether similar models can learn useful representations for images. We train a sequence Transformer to auto-regressively predict pixels, without incorporating knowledge of the 2D input structure. Despite training on low-resolution ImageNet without labels, we find that a GPT-2 scale model learns strong image representations as measured by linear probing, fine-tuning, and low-data classification. On CIFAR-10, we achieve 96.3% accuracy with a linear probe, outperforming a supervised Wide ResNet, and 99.0% accuracy with full fine-tuning, matching the top supervised pre-trained models. We are also competitive with self-supervised benchmarks on ImageNet when substituting pixels for a VQVAE encoding, achieving 69.0% top-1 accuracy on a linear probe of our features.

APA

Chen, M., Radford, A., Child, R., Wu, J., Jun, H., Luan, D. & Sutskever, I.. (2020). Generative Pretraining From Pixels. Proceedings of the 37th International Conference on Machine Learning, in Proceedings of Machine Learning Research 119:1691-1703 Available from https://proceedings.mlr.press/v119/chen20s.html.

Generative Pretraining From Pixels

Abstract

Cite this Paper

Related Material