Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

Patrick Esser; Sumith Kulal; Andreas Blattmann; Rahim Entezari; Jonas Müller; Harry Saini; Yam Levi; Dominik Lorenz; Axel Sauer; Frederic Boesel; Dustin Podell; Tim Dockhorn; Zion English; Robin Rombach

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, Robin Rombach

Proceedings of the 41st International Conference on Machine Learning, PMLR 235:12606-12633, 2024.

Abstract

Diffusion models create data from noise by inverting the forward paths of data towards noise and have emerged as a powerful generative modeling technique for high-dimensional, perceptual data such as images and videos. Rectified flow is a recent generative model formulation that connects data and noise in a straight line. Despite its better theoretical properties and conceptual simplicity, it is not yet decisively established as standard practice. In this work, we improve existing noise sampling techniques for training rectified flow models by biasing them towards perceptually relevant scales. Through a large-scale study, we demonstrate the superior performance of this approach compared to established diffusion formulations for high-resolution text-to-image synthesis. Additionally, we present a novel transformer-based architecture for text-to-image generation that uses separate weights for the two modalities and enables a bidirectional flow of information between image and text tokens, improving text comprehension, typography, and human preference ratings. We demonstrate that this architecture follows predictable scaling trends and correlates lower validation loss to improved text-to-image synthesis as measured by various metrics and human evaluations. Our largest models outperform state-of-the-art models. Stability AI is considering making experimental data, code, and model weights publicly available.

Cite this Paper

BibTeX

@InProceedings{pmlr-v235-esser24a,
  title = 	 {Scaling Rectified Flow Transformers for High-Resolution Image Synthesis},
  author =       {Esser, Patrick and Kulal, Sumith and Blattmann, Andreas and Entezari, Rahim and M\"{u}ller, Jonas and Saini, Harry and Levi, Yam and Lorenz, Dominik and Sauer, Axel and Boesel, Frederic and Podell, Dustin and Dockhorn, Tim and English, Zion and Rombach, Robin},
  booktitle = 	 {Proceedings of the 41st International Conference on Machine Learning},
  pages = 	 {12606--12633},
  year = 	 {2024},
  editor = 	 {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix},
  volume = 	 {235},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {21--27 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v235/main/assets/esser24a/esser24a.pdf},
  url = 	 {https://proceedings.mlr.press/v235/esser24a.html},
  abstract = 	 {Diffusion models create data from noise by inverting the forward paths of data towards noise and have emerged as a powerful generative modeling technique for high-dimensional, perceptual data such as images and videos. Rectified flow is a recent generative model formulation that connects data and noise in a straight line. Despite its better theoretical properties and conceptual simplicity, it is not yet decisively established as standard practice. In this work, we improve existing noise sampling techniques for training rectified flow models by biasing them towards perceptually relevant scales. Through a large-scale study, we demonstrate the superior performance of this approach compared to established diffusion formulations for high-resolution text-to-image synthesis. Additionally, we present a novel transformer-based architecture for text-to-image generation that uses separate weights for the two modalities and enables a bidirectional flow of information between image and text tokens, improving text comprehension, typography, and human preference ratings. We demonstrate that this architecture follows predictable scaling trends and correlates lower validation loss to improved text-to-image synthesis as measured by various metrics and human evaluations. Our largest models outperform state-of-the-art models. Stability AI is considering making experimental data, code, and model weights publicly available.}
}

Endnote

%0 Conference Paper
%T Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
%A Patrick Esser
%A Sumith Kulal
%A Andreas Blattmann
%A Rahim Entezari
%A Jonas Müller
%A Harry Saini
%A Yam Levi
%A Dominik Lorenz
%A Axel Sauer
%A Frederic Boesel
%A Dustin Podell
%A Tim Dockhorn
%A Zion English
%A Robin Rombach
%B Proceedings of the 41st International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2024
%E Ruslan Salakhutdinov
%E Zico Kolter
%E Katherine Heller
%E Adrian Weller
%E Nuria Oliver
%E Jonathan Scarlett
%E Felix Berkenkamp	
%F pmlr-v235-esser24a
%I PMLR
%P 12606--12633
%U https://proceedings.mlr.press/v235/esser24a.html
%V 235
%X Diffusion models create data from noise by inverting the forward paths of data towards noise and have emerged as a powerful generative modeling technique for high-dimensional, perceptual data such as images and videos. Rectified flow is a recent generative model formulation that connects data and noise in a straight line. Despite its better theoretical properties and conceptual simplicity, it is not yet decisively established as standard practice. In this work, we improve existing noise sampling techniques for training rectified flow models by biasing them towards perceptually relevant scales. Through a large-scale study, we demonstrate the superior performance of this approach compared to established diffusion formulations for high-resolution text-to-image synthesis. Additionally, we present a novel transformer-based architecture for text-to-image generation that uses separate weights for the two modalities and enables a bidirectional flow of information between image and text tokens, improving text comprehension, typography, and human preference ratings. We demonstrate that this architecture follows predictable scaling trends and correlates lower validation loss to improved text-to-image synthesis as measured by various metrics and human evaluations. Our largest models outperform state-of-the-art models. Stability AI is considering making experimental data, code, and model weights publicly available.

APA

Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., Podell, D., Dockhorn, T., English, Z. & Rombach, R.. (2024). Scaling Rectified Flow Transformers for High-Resolution Image Synthesis. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:12606-12633 Available from https://proceedings.mlr.press/v235/esser24a.html.

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

Abstract

Cite this Paper

Related Material