Zero-Shot Text-to-Image Generation

Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, Ilya Sutskever
Proceedings of the 38th International Conference on Machine Learning, PMLR 139:8821-8831, 2021.

Abstract

Text-to-image generation has traditionally focused on finding better modeling assumptions for training on a fixed dataset. These assumptions might involve complex architectures, auxiliary losses, or side information such as object part labels or segmentation masks supplied during training. We describe a simple approach for this task based on a transformer that autoregressively models the text and image tokens as a single stream of data. With sufficient data and scale, our approach is competitive with previous domain-specific models when evaluated in a zero-shot fashion.

Cite this Paper


BibTeX
@InProceedings{pmlr-v139-ramesh21a, title = {Zero-Shot Text-to-Image Generation}, author = {Ramesh, Aditya and Pavlov, Mikhail and Goh, Gabriel and Gray, Scott and Voss, Chelsea and Radford, Alec and Chen, Mark and Sutskever, Ilya}, booktitle = {Proceedings of the 38th International Conference on Machine Learning}, pages = {8821--8831}, year = {2021}, editor = {Meila, Marina and Zhang, Tong}, volume = {139}, series = {Proceedings of Machine Learning Research}, month = {18--24 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v139/ramesh21a/ramesh21a.pdf}, url = {https://proceedings.mlr.press/v139/ramesh21a.html}, abstract = {Text-to-image generation has traditionally focused on finding better modeling assumptions for training on a fixed dataset. These assumptions might involve complex architectures, auxiliary losses, or side information such as object part labels or segmentation masks supplied during training. We describe a simple approach for this task based on a transformer that autoregressively models the text and image tokens as a single stream of data. With sufficient data and scale, our approach is competitive with previous domain-specific models when evaluated in a zero-shot fashion.} }
Endnote
%0 Conference Paper %T Zero-Shot Text-to-Image Generation %A Aditya Ramesh %A Mikhail Pavlov %A Gabriel Goh %A Scott Gray %A Chelsea Voss %A Alec Radford %A Mark Chen %A Ilya Sutskever %B Proceedings of the 38th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2021 %E Marina Meila %E Tong Zhang %F pmlr-v139-ramesh21a %I PMLR %P 8821--8831 %U https://proceedings.mlr.press/v139/ramesh21a.html %V 139 %X Text-to-image generation has traditionally focused on finding better modeling assumptions for training on a fixed dataset. These assumptions might involve complex architectures, auxiliary losses, or side information such as object part labels or segmentation masks supplied during training. We describe a simple approach for this task based on a transformer that autoregressively models the text and image tokens as a single stream of data. With sufficient data and scale, our approach is competitive with previous domain-specific models when evaluated in a zero-shot fashion.
APA
Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Radford, A., Chen, M. & Sutskever, I.. (2021). Zero-Shot Text-to-Image Generation. Proceedings of the 38th International Conference on Machine Learning, in Proceedings of Machine Learning Research 139:8821-8831 Available from https://proceedings.mlr.press/v139/ramesh21a.html.

Related Material