Muse: Text-To-Image Generation via Masked Generative Transformers

Huiwen Chang; Han Zhang; Jarred Barber; Aaron Maschinot; Jose Lezama; Lu Jiang; Ming-Hsuan Yang; Kevin Patrick Murphy; William T. Freeman; Michael Rubinstein; Yuanzhen Li; Dilip Krishnan

Muse: Text-To-Image Generation via Masked Generative Transformers

Huiwen Chang, Han Zhang, Jarred Barber, Aaron Maschinot, Jose Lezama, Lu Jiang, Ming-Hsuan Yang, Kevin Patrick Murphy, William T. Freeman, Michael Rubinstein, Yuanzhen Li, Dilip Krishnan

Proceedings of the 40th International Conference on Machine Learning, PMLR 202:4055-4075, 2023.

Abstract

We present Muse, a text-to-image Transformermodel that achieves state-of-the-art image genera-tion performance while being significantly moreefficient than diffusion or autoregressive models.Muse is trained on a masked modeling task indiscrete token space: given the text embeddingextracted from a pre-trained large language model(LLM), Muse learns to predict randomly maskedimage tokens. Compared to pixel-space diffusionmodels, such as Imagen and DALL-E 2, Muse issignificantly more efficient due to the use of dis-crete tokens and requires fewer sampling itera-tions; compared to autoregressive models such asParti, Muse is more efficient due to the use of par-allel decoding. The use of a pre-trained LLM en-ables fine-grained language understanding, whichtranslates to high-fidelity image generation andthe understanding of visual concepts such as ob-jects, their spatial relationships, pose, cardinalityetc. Our 900M parameter model achieves a newSOTA on CC3M, with an FID score of 6.06. TheMuse 3B parameter model achieves an FID of7.88 on zero-shot COCO evaluation, along with aCLIP score of 0.32. Muse also directly enables anumber of image editing applications without theneed to fine-tune or invert the model: inpainting,outpainting, and mask-free editing. More resultsand videos demonstrating editing are available at https://muse-icml.github.io/

Cite this Paper

BibTeX


@InProceedings{pmlr-v202-chang23b,
  title = 	 {Muse: Text-To-Image Generation via Masked Generative Transformers},
  author =       {Chang, Huiwen and Zhang, Han and Barber, Jarred and Maschinot, Aaron and Lezama, Jose and Jiang, Lu and Yang, Ming-Hsuan and Murphy, Kevin Patrick and Freeman, William T. and Rubinstein, Michael and Li, Yuanzhen and Krishnan, Dilip},
  booktitle = 	 {Proceedings of the 40th International Conference on Machine Learning},
  pages = 	 {4055--4075},
  year = 	 {2023},
  editor = 	 {Krause, Andreas and Brunskill, Emma and Cho, Kyunghyun and Engelhardt, Barbara and Sabato, Sivan and Scarlett, Jonathan},
  volume = 	 {202},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {23--29 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v202/chang23b/chang23b.pdf},
  url = 	 {https://proceedings.mlr.press/v202/chang23b.html},
  abstract = 	 {We present Muse, a text-to-image Transformermodel that achieves state-of-the-art image genera-tion performance while being significantly moreefficient than diffusion or autoregressive models.Muse is trained on a masked modeling task indiscrete token space: given the text embeddingextracted from a pre-trained large language model(LLM), Muse learns to predict randomly maskedimage tokens. Compared to pixel-space diffusionmodels, such as Imagen and DALL-E 2, Muse issignificantly more efficient due to the use of dis-crete tokens and requires fewer sampling itera-tions; compared to autoregressive models such asParti, Muse is more efficient due to the use of par-allel decoding. The use of a pre-trained LLM en-ables fine-grained language understanding, whichtranslates to high-fidelity image generation andthe understanding of visual concepts such as ob-jects, their spatial relationships, pose, cardinalityetc. Our 900M parameter model achieves a newSOTA on CC3M, with an FID score of 6.06. TheMuse 3B parameter model achieves an FID of7.88 on zero-shot COCO evaluation, along with aCLIP score of 0.32. Muse also directly enables anumber of image editing applications without theneed to fine-tune or invert the model: inpainting,outpainting, and mask-free editing. More resultsand videos demonstrating editing are available at https://muse-icml.github.io/}
}

Endnote

%0 Conference Paper
%T Muse: Text-To-Image Generation via Masked Generative Transformers
%A Huiwen Chang
%A Han Zhang
%A Jarred Barber
%A Aaron Maschinot
%A Jose Lezama
%A Lu Jiang
%A Ming-Hsuan Yang
%A Kevin Patrick Murphy
%A William T. Freeman
%A Michael Rubinstein
%A Yuanzhen Li
%A Dilip Krishnan
%B Proceedings of the 40th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2023
%E Andreas Krause
%E Emma Brunskill
%E Kyunghyun Cho
%E Barbara Engelhardt
%E Sivan Sabato
%E Jonathan Scarlett	
%F pmlr-v202-chang23b
%I PMLR
%P 4055--4075
%U https://proceedings.mlr.press/v202/chang23b.html
%V 202
%X We present Muse, a text-to-image Transformermodel that achieves state-of-the-art image genera-tion performance while being significantly moreefficient than diffusion or autoregressive models.Muse is trained on a masked modeling task indiscrete token space: given the text embeddingextracted from a pre-trained large language model(LLM), Muse learns to predict randomly maskedimage tokens. Compared to pixel-space diffusionmodels, such as Imagen and DALL-E 2, Muse issignificantly more efficient due to the use of dis-crete tokens and requires fewer sampling itera-tions; compared to autoregressive models such asParti, Muse is more efficient due to the use of par-allel decoding. The use of a pre-trained LLM en-ables fine-grained language understanding, whichtranslates to high-fidelity image generation andthe understanding of visual concepts such as ob-jects, their spatial relationships, pose, cardinalityetc. Our 900M parameter model achieves a newSOTA on CC3M, with an FID score of 6.06. TheMuse 3B parameter model achieves an FID of7.88 on zero-shot COCO evaluation, along with aCLIP score of 0.32. Muse also directly enables anumber of image editing applications without theneed to fine-tune or invert the model: inpainting,outpainting, and mask-free editing. More resultsand videos demonstrating editing are available at https://muse-icml.github.io/

APA


Chang, H., Zhang, H., Barber, J., Maschinot, A., Lezama, J., Jiang, L., Yang, M., Murphy, K.P., Freeman, W.T., Rubinstein, M., Li, Y. & Krishnan, D.. (2023). Muse: Text-To-Image Generation via Masked Generative Transformers. Proceedings of the 40th International Conference on Machine Learning, in Proceedings of Machine Learning Research 202:4055-4075 Available from https://proceedings.mlr.press/v202/chang23b.html.

Muse: Text-To-Image Generation via Masked Generative Transformers

Abstract

Cite this Paper

Related Material