Referee Can Play: An Alternative Approach to Conditional Generation via Model Inversion

Xuantong Liu; Tianyang Hu; Wenjia Wang; Kenji Kawaguchi; Yuan Yao

Referee Can Play: An Alternative Approach to Conditional Generation via Model Inversion

Xuantong Liu, Tianyang Hu, Wenjia Wang, Kenji Kawaguchi, Yuan Yao

Proceedings of the 41st International Conference on Machine Learning, PMLR 235:31165-31185, 2024.

Abstract

As a dominant force in text-to-image generation tasks, Diffusion Probabilistic Models (DPMs) face a critical challenge in controllability, struggling to adhere strictly to complex, multi-faceted instructions. In this work, we aim to address this alignment challenge for conditional generation tasks. First, we provide an alternative view of state-of-the-art DPMs as a way of inverting advanced Vision-Language Models (VLMs). With this formulation, we naturally propose a training-free approach that bypasses the conventional sampling process associated with DPMs. By directly optimizing images with the supervision of discriminative VLMs, the proposed method can potentially achieve a better text-image alignment. As proof of concept, we demonstrate the pipeline with the pre-trained BLIP-2 model and identify several key designs for improved image generation. To further enhance the image fidelity, a Score Distillation Sampling module of Stable Diffusion is incorporated. By carefully balancing the two components during optimization, our method can produce high-quality images with near state-of-the-art performance on T2I-Compbench. The code is available at https://github.com/Pepper-lll/VLMinv.

Cite this Paper

BibTeX


@InProceedings{pmlr-v235-liu24aa,
  title = 	 {Referee Can Play: An Alternative Approach to Conditional Generation via Model Inversion},
  author =       {Liu, Xuantong and Hu, Tianyang and Wang, Wenjia and Kawaguchi, Kenji and Yao, Yuan},
  booktitle = 	 {Proceedings of the 41st International Conference on Machine Learning},
  pages = 	 {31165--31185},
  year = 	 {2024},
  editor = 	 {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix},
  volume = 	 {235},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {21--27 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v235/main/assets/liu24aa/liu24aa.pdf},
  url = 	 {https://proceedings.mlr.press/v235/liu24aa.html},
  abstract = 	 {As a dominant force in text-to-image generation tasks, Diffusion Probabilistic Models (DPMs) face a critical challenge in controllability, struggling to adhere strictly to complex, multi-faceted instructions. In this work, we aim to address this alignment challenge for conditional generation tasks. First, we provide an alternative view of state-of-the-art DPMs as a way of inverting advanced Vision-Language Models (VLMs). With this formulation, we naturally propose a training-free approach that bypasses the conventional sampling process associated with DPMs. By directly optimizing images with the supervision of discriminative VLMs, the proposed method can potentially achieve a better text-image alignment. As proof of concept, we demonstrate the pipeline with the pre-trained BLIP-2 model and identify several key designs for improved image generation. To further enhance the image fidelity, a Score Distillation Sampling module of Stable Diffusion is incorporated. By carefully balancing the two components during optimization, our method can produce high-quality images with near state-of-the-art performance on T2I-Compbench. The code is available at https://github.com/Pepper-lll/VLMinv.}
}

Endnote

%0 Conference Paper
%T Referee Can Play: An Alternative Approach to Conditional Generation via Model Inversion
%A Xuantong Liu
%A Tianyang Hu
%A Wenjia Wang
%A Kenji Kawaguchi
%A Yuan Yao
%B Proceedings of the 41st International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2024
%E Ruslan Salakhutdinov
%E Zico Kolter
%E Katherine Heller
%E Adrian Weller
%E Nuria Oliver
%E Jonathan Scarlett
%E Felix Berkenkamp	
%F pmlr-v235-liu24aa
%I PMLR
%P 31165--31185
%U https://proceedings.mlr.press/v235/liu24aa.html
%V 235
%X As a dominant force in text-to-image generation tasks, Diffusion Probabilistic Models (DPMs) face a critical challenge in controllability, struggling to adhere strictly to complex, multi-faceted instructions. In this work, we aim to address this alignment challenge for conditional generation tasks. First, we provide an alternative view of state-of-the-art DPMs as a way of inverting advanced Vision-Language Models (VLMs). With this formulation, we naturally propose a training-free approach that bypasses the conventional sampling process associated with DPMs. By directly optimizing images with the supervision of discriminative VLMs, the proposed method can potentially achieve a better text-image alignment. As proof of concept, we demonstrate the pipeline with the pre-trained BLIP-2 model and identify several key designs for improved image generation. To further enhance the image fidelity, a Score Distillation Sampling module of Stable Diffusion is incorporated. By carefully balancing the two components during optimization, our method can produce high-quality images with near state-of-the-art performance on T2I-Compbench. The code is available at https://github.com/Pepper-lll/VLMinv.

APA


Liu, X., Hu, T., Wang, W., Kawaguchi, K. & Yao, Y.. (2024). Referee Can Play: An Alternative Approach to Conditional Generation via Model Inversion. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:31165-31185 Available from https://proceedings.mlr.press/v235/liu24aa.html.

Referee Can Play: An Alternative Approach to Conditional Generation via Model Inversion

Abstract

Cite this Paper

Related Material