Generative Data Mining with Longtail-Guided Diffusion

David S Hayden, Mao Ye, Timur Garipov, Gregory P. Meyer, Carl Vondrick, Zhao Chen, Yuning Chai, Eric M Wolff, Siddhartha Srinivasa
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:22298-22317, 2025.

Abstract

It is difficult to anticipate the myriad challenges that a predictive model will encounter once deployed. Common practice entails a reactive, cyclical approach: model deployment, data mining, and retraining. We instead develop a proactive longtail discovery process by imagining additional data during training. In particular, we develop general model-based longtail signals, including a differentiable, single forward pass formulation of epistemic uncertainty that does not impact model parameters or predictive performance but can flag rare or hard inputs. We leverage these signals as guidance to generate additional training data from a latent diffusion model in a process we call Longtail Guidance (LTG). Crucially, we can perform LTG without retraining the diffusion model or the predictive model, and we do not need to expose the predictive model to intermediate diffusion states. Data generated by LTG exhibit semantically meaningful variation, yield significant generalization improvements on numerous image classification benchmarks, and can be analyzed by a VLM to proactively discover, textually explain, and address conceptual gaps in a deployed predictive model.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-hayden25a, title = {Generative Data Mining with Longtail-Guided Diffusion}, author = {Hayden, David S and Ye, Mao and Garipov, Timur and Meyer, Gregory P. and Vondrick, Carl and Chen, Zhao and Chai, Yuning and Wolff, Eric M and Srinivasa, Siddhartha}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {22298--22317}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/hayden25a/hayden25a.pdf}, url = {https://proceedings.mlr.press/v267/hayden25a.html}, abstract = {It is difficult to anticipate the myriad challenges that a predictive model will encounter once deployed. Common practice entails a reactive, cyclical approach: model deployment, data mining, and retraining. We instead develop a proactive longtail discovery process by imagining additional data during training. In particular, we develop general model-based longtail signals, including a differentiable, single forward pass formulation of epistemic uncertainty that does not impact model parameters or predictive performance but can flag rare or hard inputs. We leverage these signals as guidance to generate additional training data from a latent diffusion model in a process we call Longtail Guidance (LTG). Crucially, we can perform LTG without retraining the diffusion model or the predictive model, and we do not need to expose the predictive model to intermediate diffusion states. Data generated by LTG exhibit semantically meaningful variation, yield significant generalization improvements on numerous image classification benchmarks, and can be analyzed by a VLM to proactively discover, textually explain, and address conceptual gaps in a deployed predictive model.} }
Endnote
%0 Conference Paper %T Generative Data Mining with Longtail-Guided Diffusion %A David S Hayden %A Mao Ye %A Timur Garipov %A Gregory P. Meyer %A Carl Vondrick %A Zhao Chen %A Yuning Chai %A Eric M Wolff %A Siddhartha Srinivasa %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-hayden25a %I PMLR %P 22298--22317 %U https://proceedings.mlr.press/v267/hayden25a.html %V 267 %X It is difficult to anticipate the myriad challenges that a predictive model will encounter once deployed. Common practice entails a reactive, cyclical approach: model deployment, data mining, and retraining. We instead develop a proactive longtail discovery process by imagining additional data during training. In particular, we develop general model-based longtail signals, including a differentiable, single forward pass formulation of epistemic uncertainty that does not impact model parameters or predictive performance but can flag rare or hard inputs. We leverage these signals as guidance to generate additional training data from a latent diffusion model in a process we call Longtail Guidance (LTG). Crucially, we can perform LTG without retraining the diffusion model or the predictive model, and we do not need to expose the predictive model to intermediate diffusion states. Data generated by LTG exhibit semantically meaningful variation, yield significant generalization improvements on numerous image classification benchmarks, and can be analyzed by a VLM to proactively discover, textually explain, and address conceptual gaps in a deployed predictive model.
APA
Hayden, D.S., Ye, M., Garipov, T., Meyer, G.P., Vondrick, C., Chen, Z., Chai, Y., Wolff, E.M. & Srinivasa, S.. (2025). Generative Data Mining with Longtail-Guided Diffusion. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:22298-22317 Available from https://proceedings.mlr.press/v267/hayden25a.html.

Related Material