Provably Improving Generalization of Few-shot models with Synthetic Data

Lan-Cuong Nguyen, Quan Nguyen-Tri, Bang Tran Khanh, Dung D. Le, Long Tran-Thanh, Khoat Than
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:46178-46195, 2025.

Abstract

Few-shot image classification remains challenging due to the scarcity of labeled training examples. Augmenting them with synthetic data has emerged as a promising way to alleviate this issue, but models trained on synthetic samples often face performance degradation due to the inherent gap between real and synthetic distributions. To address this limitation, we develop a theoretical framework that quantifies the impact of such distribution discrepancies on supervised learning, specifically in the context of image classification. More importantly, our framework suggests practical ways to generate good synthetic samples and to train a predictor with high generalization ability. Building upon this framework, we propose a novel theoretical-based algorithm that integrates prototype learning to optimize both data partitioning and model training, effectively bridging the gap between real few-shot data and synthetic data. Extensive experiments results show that our approach demonstrates superior performance compared to state-of-the-art methods, outperforming them across multiple datasets.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-nguyen25h, title = {Provably Improving Generalization of Few-shot models with Synthetic Data}, author = {Nguyen, Lan-Cuong and Nguyen-Tri, Quan and Khanh, Bang Tran and Le, Dung D. and Tran-Thanh, Long and Than, Khoat}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {46178--46195}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/nguyen25h/nguyen25h.pdf}, url = {https://proceedings.mlr.press/v267/nguyen25h.html}, abstract = {Few-shot image classification remains challenging due to the scarcity of labeled training examples. Augmenting them with synthetic data has emerged as a promising way to alleviate this issue, but models trained on synthetic samples often face performance degradation due to the inherent gap between real and synthetic distributions. To address this limitation, we develop a theoretical framework that quantifies the impact of such distribution discrepancies on supervised learning, specifically in the context of image classification. More importantly, our framework suggests practical ways to generate good synthetic samples and to train a predictor with high generalization ability. Building upon this framework, we propose a novel theoretical-based algorithm that integrates prototype learning to optimize both data partitioning and model training, effectively bridging the gap between real few-shot data and synthetic data. Extensive experiments results show that our approach demonstrates superior performance compared to state-of-the-art methods, outperforming them across multiple datasets.} }
Endnote
%0 Conference Paper %T Provably Improving Generalization of Few-shot models with Synthetic Data %A Lan-Cuong Nguyen %A Quan Nguyen-Tri %A Bang Tran Khanh %A Dung D. Le %A Long Tran-Thanh %A Khoat Than %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-nguyen25h %I PMLR %P 46178--46195 %U https://proceedings.mlr.press/v267/nguyen25h.html %V 267 %X Few-shot image classification remains challenging due to the scarcity of labeled training examples. Augmenting them with synthetic data has emerged as a promising way to alleviate this issue, but models trained on synthetic samples often face performance degradation due to the inherent gap between real and synthetic distributions. To address this limitation, we develop a theoretical framework that quantifies the impact of such distribution discrepancies on supervised learning, specifically in the context of image classification. More importantly, our framework suggests practical ways to generate good synthetic samples and to train a predictor with high generalization ability. Building upon this framework, we propose a novel theoretical-based algorithm that integrates prototype learning to optimize both data partitioning and model training, effectively bridging the gap between real few-shot data and synthetic data. Extensive experiments results show that our approach demonstrates superior performance compared to state-of-the-art methods, outperforming them across multiple datasets.
APA
Nguyen, L., Nguyen-Tri, Q., Khanh, B.T., Le, D.D., Tran-Thanh, L. & Than, K.. (2025). Provably Improving Generalization of Few-shot models with Synthetic Data. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:46178-46195 Available from https://proceedings.mlr.press/v267/nguyen25h.html.

Related Material