CHATS: Combining Human-Aligned Optimization and Test-Time Sampling for Text-to-Image Generation

Minghao Fu, Guo-Hua Wang, Liangfu Cao, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:17852-17874, 2025.

Abstract

Diffusion models have emerged as a dominant approach for text-to-image generation. Key components such as the human preference alignment and classifier-free guidance play a crucial role in ensuring generation quality. However, their independent application in current text-to-image models continues to face significant challenges in achieving strong text-image alignment, high generation quality, and consistency with human aesthetic standards. In this work, we for the first time, explore facilitating the collaboration of human performance alignment and test-time sampling to unlock the potential of text-to-image models. Consequently, we introduce CHATS (Combining Human-Aligned optimization and Test-time Sampling), a novel generative framework that separately models the preferred and dispreferred distributions and employs a proxy-prompt-based sampling strategy to utilize the useful information contained in both distributions. We observe that CHATS exhibits exceptional data efficiency, achieving strong performance with only a small, high-quality funetuning dataset. Extensive experiments demonstrate that CHATS surpasses traditional preference alignment methods, setting new state-of-the-art across various standard benchmarks. The code is publicly available at github.com/AIDC-AI/CHATS.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-fu25g, title = {{CHATS}: Combining Human-Aligned Optimization and Test-Time Sampling for Text-to-Image Generation}, author = {Fu, Minghao and Wang, Guo-Hua and Cao, Liangfu and Chen, Qing-Guo and Xu, Zhao and Luo, Weihua and Zhang, Kaifu}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {17852--17874}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/fu25g/fu25g.pdf}, url = {https://proceedings.mlr.press/v267/fu25g.html}, abstract = {Diffusion models have emerged as a dominant approach for text-to-image generation. Key components such as the human preference alignment and classifier-free guidance play a crucial role in ensuring generation quality. However, their independent application in current text-to-image models continues to face significant challenges in achieving strong text-image alignment, high generation quality, and consistency with human aesthetic standards. In this work, we for the first time, explore facilitating the collaboration of human performance alignment and test-time sampling to unlock the potential of text-to-image models. Consequently, we introduce CHATS (Combining Human-Aligned optimization and Test-time Sampling), a novel generative framework that separately models the preferred and dispreferred distributions and employs a proxy-prompt-based sampling strategy to utilize the useful information contained in both distributions. We observe that CHATS exhibits exceptional data efficiency, achieving strong performance with only a small, high-quality funetuning dataset. Extensive experiments demonstrate that CHATS surpasses traditional preference alignment methods, setting new state-of-the-art across various standard benchmarks. The code is publicly available at github.com/AIDC-AI/CHATS.} }
Endnote
%0 Conference Paper %T CHATS: Combining Human-Aligned Optimization and Test-Time Sampling for Text-to-Image Generation %A Minghao Fu %A Guo-Hua Wang %A Liangfu Cao %A Qing-Guo Chen %A Zhao Xu %A Weihua Luo %A Kaifu Zhang %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-fu25g %I PMLR %P 17852--17874 %U https://proceedings.mlr.press/v267/fu25g.html %V 267 %X Diffusion models have emerged as a dominant approach for text-to-image generation. Key components such as the human preference alignment and classifier-free guidance play a crucial role in ensuring generation quality. However, their independent application in current text-to-image models continues to face significant challenges in achieving strong text-image alignment, high generation quality, and consistency with human aesthetic standards. In this work, we for the first time, explore facilitating the collaboration of human performance alignment and test-time sampling to unlock the potential of text-to-image models. Consequently, we introduce CHATS (Combining Human-Aligned optimization and Test-time Sampling), a novel generative framework that separately models the preferred and dispreferred distributions and employs a proxy-prompt-based sampling strategy to utilize the useful information contained in both distributions. We observe that CHATS exhibits exceptional data efficiency, achieving strong performance with only a small, high-quality funetuning dataset. Extensive experiments demonstrate that CHATS surpasses traditional preference alignment methods, setting new state-of-the-art across various standard benchmarks. The code is publicly available at github.com/AIDC-AI/CHATS.
APA
Fu, M., Wang, G., Cao, L., Chen, Q., Xu, Z., Luo, W. & Zhang, K.. (2025). CHATS: Combining Human-Aligned Optimization and Test-Time Sampling for Text-to-Image Generation. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:17852-17874 Available from https://proceedings.mlr.press/v267/fu25g.html.

Related Material