Training Diffusion-based Generative Models with Limited Data

Zhaoyu Zhang, Yang Hua, Guanxiong Sun, Hui Wang, Seán Mcloone
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:74950-74965, 2025.

Abstract

Diffusion-based generative models (diffusion models) often require a large amount of data to train a score-based model that learns the score function of the data distribution through denoising score matching. However, collecting and cleaning such data can be expensive, time-consuming, and even infeasible. In this paper, we present a novel theoretical insight for diffusion models that two factors, i.e., the denoiser function hypothesis space and the number of training samples, can affect the denoising score matching error of all training samples. Based on this theoretical insight, it is evident that minimizing the total denoising score matching error is challenging within the denoiser function hypothesis space in existing methods, when training diffusion models with limited data. To address this, we propose a new diffusion model called Limited Data Diffusion (LD-Diffusion), which consists of two main components: a compressing model and a novel mixed augmentation with fixed probability (MAFP) strategy. Specifically, the compressing model can constrain the complexity of the denoiser function hypothesis space and MAFP can effectively increase the training samples by providing more informative guidance than existing data augmentation methods in the compressed hypothesis space. Extensive experiments on several datasets demonstrate that LD-Diffusion can achieve better performance compared to other diffusion models. Codes are available at https://github.com/zzhang05/LD-Diffusion.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-zhang25z, title = {Training Diffusion-based Generative Models with Limited Data}, author = {Zhang, Zhaoyu and Hua, Yang and Sun, Guanxiong and Wang, Hui and Mcloone, Se\'{a}n}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {74950--74965}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/zhang25z/zhang25z.pdf}, url = {https://proceedings.mlr.press/v267/zhang25z.html}, abstract = {Diffusion-based generative models (diffusion models) often require a large amount of data to train a score-based model that learns the score function of the data distribution through denoising score matching. However, collecting and cleaning such data can be expensive, time-consuming, and even infeasible. In this paper, we present a novel theoretical insight for diffusion models that two factors, i.e., the denoiser function hypothesis space and the number of training samples, can affect the denoising score matching error of all training samples. Based on this theoretical insight, it is evident that minimizing the total denoising score matching error is challenging within the denoiser function hypothesis space in existing methods, when training diffusion models with limited data. To address this, we propose a new diffusion model called Limited Data Diffusion (LD-Diffusion), which consists of two main components: a compressing model and a novel mixed augmentation with fixed probability (MAFP) strategy. Specifically, the compressing model can constrain the complexity of the denoiser function hypothesis space and MAFP can effectively increase the training samples by providing more informative guidance than existing data augmentation methods in the compressed hypothesis space. Extensive experiments on several datasets demonstrate that LD-Diffusion can achieve better performance compared to other diffusion models. Codes are available at https://github.com/zzhang05/LD-Diffusion.} }
Endnote
%0 Conference Paper %T Training Diffusion-based Generative Models with Limited Data %A Zhaoyu Zhang %A Yang Hua %A Guanxiong Sun %A Hui Wang %A Seán Mcloone %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-zhang25z %I PMLR %P 74950--74965 %U https://proceedings.mlr.press/v267/zhang25z.html %V 267 %X Diffusion-based generative models (diffusion models) often require a large amount of data to train a score-based model that learns the score function of the data distribution through denoising score matching. However, collecting and cleaning such data can be expensive, time-consuming, and even infeasible. In this paper, we present a novel theoretical insight for diffusion models that two factors, i.e., the denoiser function hypothesis space and the number of training samples, can affect the denoising score matching error of all training samples. Based on this theoretical insight, it is evident that minimizing the total denoising score matching error is challenging within the denoiser function hypothesis space in existing methods, when training diffusion models with limited data. To address this, we propose a new diffusion model called Limited Data Diffusion (LD-Diffusion), which consists of two main components: a compressing model and a novel mixed augmentation with fixed probability (MAFP) strategy. Specifically, the compressing model can constrain the complexity of the denoiser function hypothesis space and MAFP can effectively increase the training samples by providing more informative guidance than existing data augmentation methods in the compressed hypothesis space. Extensive experiments on several datasets demonstrate that LD-Diffusion can achieve better performance compared to other diffusion models. Codes are available at https://github.com/zzhang05/LD-Diffusion.
APA
Zhang, Z., Hua, Y., Sun, G., Wang, H. & Mcloone, S.. (2025). Training Diffusion-based Generative Models with Limited Data. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:74950-74965 Available from https://proceedings.mlr.press/v267/zhang25z.html.

Related Material