Fast Imagic: Solving Overfitting in Text-guided Image Editing via Disentangled UNet with Forgetting Mechanism and Unified Vision-Language Optimization

Shiwen Zhang
Proceedings of UniReps: the Second Edition of the Workshop on Unifying Representations in Neural Models, PMLR 285:232-243, 2024.

Abstract

Text-guided image editing on real or synthetic images, given only the original image itself and the target text prompt as inputs, is a very general and challenging task. It requires an editing model to estimate by itself which part of the image should be edited, and then perform either rigid or non-rigid editing while preserving the characteristics of original image. Imagic, the previous SOTA solution to text-guided image editing, suffers from slow optimization speed, and is prone to overfitting since there is only one image given. In this paper, we design a novel text-guided image editing method, Fast Imagic. First, we propose a vision-language joint optimization framework for fast aligning text embedding and UNet with the given image, which is capable of understanding and reconstructing the original image in 30 seconds, much faster and much less overfitting than previous SOTA Imagic. Then we propose a novel vector projection mechanism in text embedding space of Diffusion Models, capable of decomposing the identity similarity and editing strength thus controlling them separately. Finally, we discovered a general disentanglement property of UNet in Diffusion Models, i.e., UNet encoder learns space and structure, UNet decoder learns appearance and texture. With such a property, we design the forgetting mechanism by merging original checkpoint and optimized checkpoint to successfully tackle the fatal and inevitable overfitting issues when fine-tuning Diffusion Models on one image, thus significantly boosting the editing capability of Diffusion Models. Our method, Fast Imagic, even built on the outdated Stable Diffusion, achieves new state-of-the-art results on the challenging text-guided image editing benchmark: TEdBench, surpassing the previous SOTA methods such as Imagic with Imagen, in terms of both CLIP score and LPIPS score.

Cite this Paper


BibTeX
@InProceedings{pmlr-v285-zhang24a, title = {Fast Imagic: Solving Overfitting in Text-guided Image Editing via Disentangled {UN}et with Forgetting Mechanism and Unified Vision-Language Optimization}, author = {Zhang, Shiwen}, booktitle = {Proceedings of UniReps: the Second Edition of the Workshop on Unifying Representations in Neural Models}, pages = {232--243}, year = {2024}, editor = {Fumero, Marco and Domine, Clementine and Lähner, Zorah and Crisostomi, Donato and Moschella, Luca and Stachenfeld, Kimberly}, volume = {285}, series = {Proceedings of Machine Learning Research}, month = {14 Dec}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v285/main/assets/zhang24a/zhang24a.pdf}, url = {https://proceedings.mlr.press/v285/zhang24a.html}, abstract = {Text-guided image editing on real or synthetic images, given only the original image itself and the target text prompt as inputs, is a very general and challenging task. It requires an editing model to estimate by itself which part of the image should be edited, and then perform either rigid or non-rigid editing while preserving the characteristics of original image. Imagic, the previous SOTA solution to text-guided image editing, suffers from slow optimization speed, and is prone to overfitting since there is only one image given. In this paper, we design a novel text-guided image editing method, Fast Imagic. First, we propose a vision-language joint optimization framework for fast aligning text embedding and UNet with the given image, which is capable of understanding and reconstructing the original image in 30 seconds, much faster and much less overfitting than previous SOTA Imagic. Then we propose a novel vector projection mechanism in text embedding space of Diffusion Models, capable of decomposing the identity similarity and editing strength thus controlling them separately. Finally, we discovered a general disentanglement property of UNet in Diffusion Models, i.e., UNet encoder learns space and structure, UNet decoder learns appearance and texture. With such a property, we design the forgetting mechanism by merging original checkpoint and optimized checkpoint to successfully tackle the fatal and inevitable overfitting issues when fine-tuning Diffusion Models on one image, thus significantly boosting the editing capability of Diffusion Models. Our method, Fast Imagic, even built on the outdated Stable Diffusion, achieves new state-of-the-art results on the challenging text-guided image editing benchmark: TEdBench, surpassing the previous SOTA methods such as Imagic with Imagen, in terms of both CLIP score and LPIPS score.} }
Endnote
%0 Conference Paper %T Fast Imagic: Solving Overfitting in Text-guided Image Editing via Disentangled UNet with Forgetting Mechanism and Unified Vision-Language Optimization %A Shiwen Zhang %B Proceedings of UniReps: the Second Edition of the Workshop on Unifying Representations in Neural Models %C Proceedings of Machine Learning Research %D 2024 %E Marco Fumero %E Clementine Domine %E Zorah Lähner %E Donato Crisostomi %E Luca Moschella %E Kimberly Stachenfeld %F pmlr-v285-zhang24a %I PMLR %P 232--243 %U https://proceedings.mlr.press/v285/zhang24a.html %V 285 %X Text-guided image editing on real or synthetic images, given only the original image itself and the target text prompt as inputs, is a very general and challenging task. It requires an editing model to estimate by itself which part of the image should be edited, and then perform either rigid or non-rigid editing while preserving the characteristics of original image. Imagic, the previous SOTA solution to text-guided image editing, suffers from slow optimization speed, and is prone to overfitting since there is only one image given. In this paper, we design a novel text-guided image editing method, Fast Imagic. First, we propose a vision-language joint optimization framework for fast aligning text embedding and UNet with the given image, which is capable of understanding and reconstructing the original image in 30 seconds, much faster and much less overfitting than previous SOTA Imagic. Then we propose a novel vector projection mechanism in text embedding space of Diffusion Models, capable of decomposing the identity similarity and editing strength thus controlling them separately. Finally, we discovered a general disentanglement property of UNet in Diffusion Models, i.e., UNet encoder learns space and structure, UNet decoder learns appearance and texture. With such a property, we design the forgetting mechanism by merging original checkpoint and optimized checkpoint to successfully tackle the fatal and inevitable overfitting issues when fine-tuning Diffusion Models on one image, thus significantly boosting the editing capability of Diffusion Models. Our method, Fast Imagic, even built on the outdated Stable Diffusion, achieves new state-of-the-art results on the challenging text-guided image editing benchmark: TEdBench, surpassing the previous SOTA methods such as Imagic with Imagen, in terms of both CLIP score and LPIPS score.
APA
Zhang, S.. (2024). Fast Imagic: Solving Overfitting in Text-guided Image Editing via Disentangled UNet with Forgetting Mechanism and Unified Vision-Language Optimization. Proceedings of UniReps: the Second Edition of the Workshop on Unifying Representations in Neural Models, in Proceedings of Machine Learning Research 285:232-243 Available from https://proceedings.mlr.press/v285/zhang24a.html.

Related Material