D-Fusion: Direct Preference Optimization for Aligning Diffusion Models with Visually Consistent Samples

Zijing Hu, Fengda Zhang, Kun Kuang
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:24869-24892, 2025.

Abstract

The practical applications of diffusion models have been limited by the misalignment between generated images and corresponding text prompts. Recent studies have introduced direct preference optimization (DPO) to enhance the alignment of these models. However, the effectiveness of DPO is constrained by the issue of visual inconsistency, where the significant visual disparity between well-aligned and poorly-aligned images prevents diffusion models from identifying which factors contribute positively to alignment during fine-tuning. To address this issue, this paper introduces D-Fusion, a method to construct DPO-trainable visually consistent samples. On one hand, by performing mask-guided self-attention fusion, the resulting images are not only well-aligned, but also visually consistent with given poorly-aligned images. On the other hand, D-Fusion can retain the denoising trajectories of the resulting images, which are essential for DPO training. Extensive experiments demonstrate the effectiveness of D-Fusion in improving prompt-image alignment when applied to different reinforcement learning algorithms.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-hu25ab, title = {D-Fusion: Direct Preference Optimization for Aligning Diffusion Models with Visually Consistent Samples}, author = {Hu, Zijing and Zhang, Fengda and Kuang, Kun}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {24869--24892}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/hu25ab/hu25ab.pdf}, url = {https://proceedings.mlr.press/v267/hu25ab.html}, abstract = {The practical applications of diffusion models have been limited by the misalignment between generated images and corresponding text prompts. Recent studies have introduced direct preference optimization (DPO) to enhance the alignment of these models. However, the effectiveness of DPO is constrained by the issue of visual inconsistency, where the significant visual disparity between well-aligned and poorly-aligned images prevents diffusion models from identifying which factors contribute positively to alignment during fine-tuning. To address this issue, this paper introduces D-Fusion, a method to construct DPO-trainable visually consistent samples. On one hand, by performing mask-guided self-attention fusion, the resulting images are not only well-aligned, but also visually consistent with given poorly-aligned images. On the other hand, D-Fusion can retain the denoising trajectories of the resulting images, which are essential for DPO training. Extensive experiments demonstrate the effectiveness of D-Fusion in improving prompt-image alignment when applied to different reinforcement learning algorithms.} }
Endnote
%0 Conference Paper %T D-Fusion: Direct Preference Optimization for Aligning Diffusion Models with Visually Consistent Samples %A Zijing Hu %A Fengda Zhang %A Kun Kuang %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-hu25ab %I PMLR %P 24869--24892 %U https://proceedings.mlr.press/v267/hu25ab.html %V 267 %X The practical applications of diffusion models have been limited by the misalignment between generated images and corresponding text prompts. Recent studies have introduced direct preference optimization (DPO) to enhance the alignment of these models. However, the effectiveness of DPO is constrained by the issue of visual inconsistency, where the significant visual disparity between well-aligned and poorly-aligned images prevents diffusion models from identifying which factors contribute positively to alignment during fine-tuning. To address this issue, this paper introduces D-Fusion, a method to construct DPO-trainable visually consistent samples. On one hand, by performing mask-guided self-attention fusion, the resulting images are not only well-aligned, but also visually consistent with given poorly-aligned images. On the other hand, D-Fusion can retain the denoising trajectories of the resulting images, which are essential for DPO training. Extensive experiments demonstrate the effectiveness of D-Fusion in improving prompt-image alignment when applied to different reinforcement learning algorithms.
APA
Hu, Z., Zhang, F. & Kuang, K.. (2025). D-Fusion: Direct Preference Optimization for Aligning Diffusion Models with Visually Consistent Samples. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:24869-24892 Available from https://proceedings.mlr.press/v267/hu25ab.html.

Related Material