Score as Action: Fine Tuning Diffusion Generative Models by Continuous-time Reinforcement Learning

Hanyang Zhao, Haoxian Chen, Ji Zhang, David Yao, Wenpin Tang
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:77371-77389, 2025.

Abstract

Reinforcement learning from human feedback (RLHF), which aligns a diffusion model with input prompt, has become a crucial step in building reliable generative AI models. Most works in this area uses a discrete-time formulation, which is prone to induced errors, and often not applicable to models with higher-order/black-box solvers. The objective of this study is to develop a disciplined approach to fine-tuning diffusion models using continuous-time RL, formulated as a stochastic control problem with a reward function that aligns the end result (terminal state) with input prompt. The key idea is to treat score matching as controls or actions, and thereby connecting to policy optimization and regularization in continuous-time RL. To carry out this idea, we lay out a new policy optimization framework for continuous-time RL, and illustrate its potential in enhancing the value networks design space via leveraging the structural property of diffusion models. We validate the advantages of our method by experiments in downstream tasks of fine-tuning large-scale Text2Image models, Stable Diffusion v1.5.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-zhao25f, title = {Score as Action: Fine Tuning Diffusion Generative Models by Continuous-time Reinforcement Learning}, author = {Zhao, Hanyang and Chen, Haoxian and Zhang, Ji and Yao, David and Tang, Wenpin}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {77371--77389}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/zhao25f/zhao25f.pdf}, url = {https://proceedings.mlr.press/v267/zhao25f.html}, abstract = {Reinforcement learning from human feedback (RLHF), which aligns a diffusion model with input prompt, has become a crucial step in building reliable generative AI models. Most works in this area uses a discrete-time formulation, which is prone to induced errors, and often not applicable to models with higher-order/black-box solvers. The objective of this study is to develop a disciplined approach to fine-tuning diffusion models using continuous-time RL, formulated as a stochastic control problem with a reward function that aligns the end result (terminal state) with input prompt. The key idea is to treat score matching as controls or actions, and thereby connecting to policy optimization and regularization in continuous-time RL. To carry out this idea, we lay out a new policy optimization framework for continuous-time RL, and illustrate its potential in enhancing the value networks design space via leveraging the structural property of diffusion models. We validate the advantages of our method by experiments in downstream tasks of fine-tuning large-scale Text2Image models, Stable Diffusion v1.5.} }
Endnote
%0 Conference Paper %T Score as Action: Fine Tuning Diffusion Generative Models by Continuous-time Reinforcement Learning %A Hanyang Zhao %A Haoxian Chen %A Ji Zhang %A David Yao %A Wenpin Tang %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-zhao25f %I PMLR %P 77371--77389 %U https://proceedings.mlr.press/v267/zhao25f.html %V 267 %X Reinforcement learning from human feedback (RLHF), which aligns a diffusion model with input prompt, has become a crucial step in building reliable generative AI models. Most works in this area uses a discrete-time formulation, which is prone to induced errors, and often not applicable to models with higher-order/black-box solvers. The objective of this study is to develop a disciplined approach to fine-tuning diffusion models using continuous-time RL, formulated as a stochastic control problem with a reward function that aligns the end result (terminal state) with input prompt. The key idea is to treat score matching as controls or actions, and thereby connecting to policy optimization and regularization in continuous-time RL. To carry out this idea, we lay out a new policy optimization framework for continuous-time RL, and illustrate its potential in enhancing the value networks design space via leveraging the structural property of diffusion models. We validate the advantages of our method by experiments in downstream tasks of fine-tuning large-scale Text2Image models, Stable Diffusion v1.5.
APA
Zhao, H., Chen, H., Zhang, J., Yao, D. & Tang, W.. (2025). Score as Action: Fine Tuning Diffusion Generative Models by Continuous-time Reinforcement Learning. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:77371-77389 Available from https://proceedings.mlr.press/v267/zhao25f.html.

Related Material