A Dense Reward View on Aligning Text-to-Image Diffusion with Preference

Shentao Yang, Tianqi Chen, Mingyuan Zhou
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:55998-56032, 2024.

Abstract

Aligning text-to-image diffusion model (T2I) with preference has been gaining increasing research attention. While prior works exist on directly optimizing T2I by preference data, these methods are developed under the bandit assumption of a latent reward on the entire diffusion reverse chain, while ignoring the sequential nature of the generation process. This may harm the efficacy and efficiency of preference alignment. In this paper, we take on a finer dense reward perspective and derive a tractable alignment objective that emphasizes the initial steps of the T2I reverse chain. In particular, we introduce temporal discounting into DPO-style explicit-reward-free objectives, to break the temporal symmetry therein and suit the T2I generation hierarchy. In experiments on single and multiple prompt generation, our method is competitive with strong relevant baselines, both quantitatively and qualitatively. Further investigations are conducted to illustrate the insight of our approach. Source code is available at https://github.com/Shentao-YANG/Dense_Reward_T2I .

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-yang24e, title = {A Dense Reward View on Aligning Text-to-Image Diffusion with Preference}, author = {Yang, Shentao and Chen, Tianqi and Zhou, Mingyuan}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {55998--56032}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/yang24e/yang24e.pdf}, url = {https://proceedings.mlr.press/v235/yang24e.html}, abstract = {Aligning text-to-image diffusion model (T2I) with preference has been gaining increasing research attention. While prior works exist on directly optimizing T2I by preference data, these methods are developed under the bandit assumption of a latent reward on the entire diffusion reverse chain, while ignoring the sequential nature of the generation process. This may harm the efficacy and efficiency of preference alignment. In this paper, we take on a finer dense reward perspective and derive a tractable alignment objective that emphasizes the initial steps of the T2I reverse chain. In particular, we introduce temporal discounting into DPO-style explicit-reward-free objectives, to break the temporal symmetry therein and suit the T2I generation hierarchy. In experiments on single and multiple prompt generation, our method is competitive with strong relevant baselines, both quantitatively and qualitatively. Further investigations are conducted to illustrate the insight of our approach. Source code is available at https://github.com/Shentao-YANG/Dense_Reward_T2I .} }
Endnote
%0 Conference Paper %T A Dense Reward View on Aligning Text-to-Image Diffusion with Preference %A Shentao Yang %A Tianqi Chen %A Mingyuan Zhou %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-yang24e %I PMLR %P 55998--56032 %U https://proceedings.mlr.press/v235/yang24e.html %V 235 %X Aligning text-to-image diffusion model (T2I) with preference has been gaining increasing research attention. While prior works exist on directly optimizing T2I by preference data, these methods are developed under the bandit assumption of a latent reward on the entire diffusion reverse chain, while ignoring the sequential nature of the generation process. This may harm the efficacy and efficiency of preference alignment. In this paper, we take on a finer dense reward perspective and derive a tractable alignment objective that emphasizes the initial steps of the T2I reverse chain. In particular, we introduce temporal discounting into DPO-style explicit-reward-free objectives, to break the temporal symmetry therein and suit the T2I generation hierarchy. In experiments on single and multiple prompt generation, our method is competitive with strong relevant baselines, both quantitatively and qualitatively. Further investigations are conducted to illustrate the insight of our approach. Source code is available at https://github.com/Shentao-YANG/Dense_Reward_T2I .
APA
Yang, S., Chen, T. & Zhou, M.. (2024). A Dense Reward View on Aligning Text-to-Image Diffusion with Preference. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:55998-56032 Available from https://proceedings.mlr.press/v235/yang24e.html.

Related Material