StyleGAN-T: Unlocking the Power of GANs for Fast Large-Scale Text-to-Image Synthesis

Axel Sauer; Tero Karras; Samuli Laine; Andreas Geiger; Timo Aila

StyleGAN-T: Unlocking the Power of GANs for Fast Large-Scale Text-to-Image Synthesis

Axel Sauer, Tero Karras, Samuli Laine, Andreas Geiger, Timo Aila

Proceedings of the 40th International Conference on Machine Learning, PMLR 202:30105-30118, 2023.

Abstract

Text-to-image synthesis has recently seen significant progress thanks to large pretrained language models, large-scale training data, and the introduction of scalable model families such as diffusion and autoregressive models. However, the best-performing models require iterative evaluation to generate a single sample. In contrast, generative adversarial networks (GANs) only need a single forward pass. They are thus much faster, but they currently remain far behind the state-of-the-art in large-scale text-to-image synthesis. This paper aims to identify the necessary steps to regain competitiveness. Our proposed model, StyleGAN-T, addresses the specific requirements of large-scale text-to-image synthesis, such as large capacity, stable training on diverse datasets, strong text alignment, and controllable variation vs. text alignment tradeoff. StyleGAN-T significantly improves over previous GANs and outperforms distilled diffusion models - the previous state-of-the-art in fast text-to-image synthesis - in terms of sample quality and speed.

Cite this Paper

BibTeX

@InProceedings{pmlr-v202-sauer23a,
  title = 	 {{S}tyle{GAN}-T: Unlocking the Power of {GAN}s for Fast Large-Scale Text-to-Image Synthesis},
  author =       {Sauer, Axel and Karras, Tero and Laine, Samuli and Geiger, Andreas and Aila, Timo},
  booktitle = 	 {Proceedings of the 40th International Conference on Machine Learning},
  pages = 	 {30105--30118},
  year = 	 {2023},
  editor = 	 {Krause, Andreas and Brunskill, Emma and Cho, Kyunghyun and Engelhardt, Barbara and Sabato, Sivan and Scarlett, Jonathan},
  volume = 	 {202},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {23--29 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v202/sauer23a/sauer23a.pdf},
  url = 	 {https://proceedings.mlr.press/v202/sauer23a.html},
  abstract = 	 {Text-to-image synthesis has recently seen significant progress thanks to large pretrained language models, large-scale training data, and the introduction of scalable model families such as diffusion and autoregressive models. However, the best-performing models require iterative evaluation to generate a single sample. In contrast, generative adversarial networks (GANs) only need a single forward pass. They are thus much faster, but they currently remain far behind the state-of-the-art in large-scale text-to-image synthesis. This paper aims to identify the necessary steps to regain competitiveness. Our proposed model, StyleGAN-T, addresses the specific requirements of large-scale text-to-image synthesis, such as large capacity, stable training on diverse datasets, strong text alignment, and controllable variation vs. text alignment tradeoff. StyleGAN-T significantly improves over previous GANs and outperforms distilled diffusion models - the previous state-of-the-art in fast text-to-image synthesis - in terms of sample quality and speed.}
}

Endnote

%0 Conference Paper
%T StyleGAN-T: Unlocking the Power of GANs for Fast Large-Scale Text-to-Image Synthesis
%A Axel Sauer
%A Tero Karras
%A Samuli Laine
%A Andreas Geiger
%A Timo Aila
%B Proceedings of the 40th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2023
%E Andreas Krause
%E Emma Brunskill
%E Kyunghyun Cho
%E Barbara Engelhardt
%E Sivan Sabato
%E Jonathan Scarlett	
%F pmlr-v202-sauer23a
%I PMLR
%P 30105--30118
%U https://proceedings.mlr.press/v202/sauer23a.html
%V 202
%X Text-to-image synthesis has recently seen significant progress thanks to large pretrained language models, large-scale training data, and the introduction of scalable model families such as diffusion and autoregressive models. However, the best-performing models require iterative evaluation to generate a single sample. In contrast, generative adversarial networks (GANs) only need a single forward pass. They are thus much faster, but they currently remain far behind the state-of-the-art in large-scale text-to-image synthesis. This paper aims to identify the necessary steps to regain competitiveness. Our proposed model, StyleGAN-T, addresses the specific requirements of large-scale text-to-image synthesis, such as large capacity, stable training on diverse datasets, strong text alignment, and controllable variation vs. text alignment tradeoff. StyleGAN-T significantly improves over previous GANs and outperforms distilled diffusion models - the previous state-of-the-art in fast text-to-image synthesis - in terms of sample quality and speed.

APA

Sauer, A., Karras, T., Laine, S., Geiger, A. & Aila, T.. (2023). StyleGAN-T: Unlocking the Power of GANs for Fast Large-Scale Text-to-Image Synthesis. Proceedings of the 40th International Conference on Machine Learning, in Proceedings of Machine Learning Research 202:30105-30118 Available from https://proceedings.mlr.press/v202/sauer23a.html.

StyleGAN-T: Unlocking the Power of GANs for Fast Large-Scale Text-to-Image Synthesis

Abstract

Cite this Paper

Related Material