Towards Theoretical Understandings of Self-Consuming Generative Models

Shi Fu, Sen Zhang, Yingjie Wang, Xinmei Tian, Dacheng Tao
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:14228-14255, 2024.

Abstract

This paper tackles the emerging challenge of training generative models within a self-consuming loop, wherein successive generations of models are recursively trained on mixtures of real and synthetic data from previous generations. We construct a theoretical framework to rigorously evaluate how this training procedure impacts the data distributions learned by future models, including parametric and non-parametric models. Specifically, we derive bounds on the total variation (TV) distance between the synthetic data distributions produced by future models and the original real data distribution under various mixed training scenarios for diffusion models with a one-hidden-layer neural network score function. Our analysis demonstrates that this distance can be effectively controlled under the condition that mixed training dataset sizes or proportions of real data are large enough. Interestingly, we further unveil a phase transition induced by expanding synthetic data amounts, proving theoretically that while the TV distance exhibits an initial ascent, it declines beyond a threshold point. Finally, we present results for kernel density estimation, delivering nuanced insights such as the impact of mixed data training on error propagation.

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-fu24i, title = {Towards Theoretical Understandings of Self-Consuming Generative Models}, author = {Fu, Shi and Zhang, Sen and Wang, Yingjie and Tian, Xinmei and Tao, Dacheng}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {14228--14255}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/fu24i/fu24i.pdf}, url = {https://proceedings.mlr.press/v235/fu24i.html}, abstract = {This paper tackles the emerging challenge of training generative models within a self-consuming loop, wherein successive generations of models are recursively trained on mixtures of real and synthetic data from previous generations. We construct a theoretical framework to rigorously evaluate how this training procedure impacts the data distributions learned by future models, including parametric and non-parametric models. Specifically, we derive bounds on the total variation (TV) distance between the synthetic data distributions produced by future models and the original real data distribution under various mixed training scenarios for diffusion models with a one-hidden-layer neural network score function. Our analysis demonstrates that this distance can be effectively controlled under the condition that mixed training dataset sizes or proportions of real data are large enough. Interestingly, we further unveil a phase transition induced by expanding synthetic data amounts, proving theoretically that while the TV distance exhibits an initial ascent, it declines beyond a threshold point. Finally, we present results for kernel density estimation, delivering nuanced insights such as the impact of mixed data training on error propagation.} }
Endnote
%0 Conference Paper %T Towards Theoretical Understandings of Self-Consuming Generative Models %A Shi Fu %A Sen Zhang %A Yingjie Wang %A Xinmei Tian %A Dacheng Tao %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-fu24i %I PMLR %P 14228--14255 %U https://proceedings.mlr.press/v235/fu24i.html %V 235 %X This paper tackles the emerging challenge of training generative models within a self-consuming loop, wherein successive generations of models are recursively trained on mixtures of real and synthetic data from previous generations. We construct a theoretical framework to rigorously evaluate how this training procedure impacts the data distributions learned by future models, including parametric and non-parametric models. Specifically, we derive bounds on the total variation (TV) distance between the synthetic data distributions produced by future models and the original real data distribution under various mixed training scenarios for diffusion models with a one-hidden-layer neural network score function. Our analysis demonstrates that this distance can be effectively controlled under the condition that mixed training dataset sizes or proportions of real data are large enough. Interestingly, we further unveil a phase transition induced by expanding synthetic data amounts, proving theoretically that while the TV distance exhibits an initial ascent, it declines beyond a threshold point. Finally, we present results for kernel density estimation, delivering nuanced insights such as the impact of mixed data training on error propagation.
APA
Fu, S., Zhang, S., Wang, Y., Tian, X. & Tao, D.. (2024). Towards Theoretical Understandings of Self-Consuming Generative Models. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:14228-14255 Available from https://proceedings.mlr.press/v235/fu24i.html.

Related Material