Breaking the Barrier of Hard Samples: A Data-Centric Approach to Synthetic Data for Medical Tasks

Maynara Donato De Souza, Cleber Zanchettin
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:12762-12815, 2025.

Abstract

Data scarcity and quality issues remain significant barriers to developing robust predictive models in medical research. Traditional reliance on real-world data often leads to biased models with poor generalizability across diverse patient populations. Synthetic data generation has emerged as a promising solution, yet challenges related to these sample’s representativeness and effective utilization persist. This paper introduces Profile2Gen, a novel data-centric framework designed to guide the generation and refinement of synthetic data, focusing on addressing hard-to-learn samples in regression tasks. We conducted approximately 18,000 experiments to validate its effectiveness across six medical datasets, utilizing seven state-of-the-art generative models. Results demonstrate that refined synthetic samples can reduce predictive errors and enhance model reliability. Additionally, we generalize the DataIQ framework to support regression tasks, enabling its application in broader contexts. Statistical analyses confirm that our approach achieves equal or superior performance compared to models trained exclusively on real data.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-de-souza25a, title = {Breaking the Barrier of Hard Samples: A Data-Centric Approach to Synthetic Data for Medical Tasks}, author = {De Souza, Maynara Donato and Zanchettin, Cleber}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {12762--12815}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/de-souza25a/de-souza25a.pdf}, url = {https://proceedings.mlr.press/v267/de-souza25a.html}, abstract = {Data scarcity and quality issues remain significant barriers to developing robust predictive models in medical research. Traditional reliance on real-world data often leads to biased models with poor generalizability across diverse patient populations. Synthetic data generation has emerged as a promising solution, yet challenges related to these sample’s representativeness and effective utilization persist. This paper introduces Profile2Gen, a novel data-centric framework designed to guide the generation and refinement of synthetic data, focusing on addressing hard-to-learn samples in regression tasks. We conducted approximately 18,000 experiments to validate its effectiveness across six medical datasets, utilizing seven state-of-the-art generative models. Results demonstrate that refined synthetic samples can reduce predictive errors and enhance model reliability. Additionally, we generalize the DataIQ framework to support regression tasks, enabling its application in broader contexts. Statistical analyses confirm that our approach achieves equal or superior performance compared to models trained exclusively on real data.} }
Endnote
%0 Conference Paper %T Breaking the Barrier of Hard Samples: A Data-Centric Approach to Synthetic Data for Medical Tasks %A Maynara Donato De Souza %A Cleber Zanchettin %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-de-souza25a %I PMLR %P 12762--12815 %U https://proceedings.mlr.press/v267/de-souza25a.html %V 267 %X Data scarcity and quality issues remain significant barriers to developing robust predictive models in medical research. Traditional reliance on real-world data often leads to biased models with poor generalizability across diverse patient populations. Synthetic data generation has emerged as a promising solution, yet challenges related to these sample’s representativeness and effective utilization persist. This paper introduces Profile2Gen, a novel data-centric framework designed to guide the generation and refinement of synthetic data, focusing on addressing hard-to-learn samples in regression tasks. We conducted approximately 18,000 experiments to validate its effectiveness across six medical datasets, utilizing seven state-of-the-art generative models. Results demonstrate that refined synthetic samples can reduce predictive errors and enhance model reliability. Additionally, we generalize the DataIQ framework to support regression tasks, enabling its application in broader contexts. Statistical analyses confirm that our approach achieves equal or superior performance compared to models trained exclusively on real data.
APA
De Souza, M.D. & Zanchettin, C.. (2025). Breaking the Barrier of Hard Samples: A Data-Centric Approach to Synthetic Data for Medical Tasks. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:12762-12815 Available from https://proceedings.mlr.press/v267/de-souza25a.html.

Related Material