Informative Synthetic Data Generation for Thorax Disease Classification

Yancheng Wang, Rajeev Goel, Marko Jojic, Alvin C. Silva, Teresa Wu, Yingzhen Yang
Proceedings of the Forty-first Conference on Uncertainty in Artificial Intelligence, PMLR 286:4489-4514, 2025.

Abstract

Deep Neural Networks (DNNs), including architectures such as Vision Transformers (ViTs), have achieved remarkable success in medical imaging tasks. However, their performance typically hinges on the availability of large-scale, high-quality labeled datasets-resources that are often scarce or infeasible to obtain in medical domains. Generative Data Augmentation (GDA) offers a promising remedy by supplementing training sets with synthetic data generated via generative models like Diffusion Models (DMs). Yet, this approach introduces a critical challenge: synthetic data often contains significant noise, which can degrade the performance of classifiers trained on such augmented datasets. Prior solutions, including data selection and re-weighting techniques, often rely on access to clean metadata or pretrained external classifiers. In this work, we propose \emph{Informative Data Selection} (IDS), a principled sample re-weighting framework grounded in the Information Bottleneck (IB) principle. IDS assigns higher weights to more informative synthetic samples, thereby improving classifier performance in GDA-enhanced training for thorax disease classification. Extensive experiments demonstrate that IDS significantly outperforms existing data selection and re-weighting baselines. Our code is publicly available at \url{https://github.com/Statistical-Deep-Learning/IDS}.

Cite this Paper


BibTeX
@InProceedings{pmlr-v286-wang25g, title = {Informative Synthetic Data Generation for Thorax Disease Classification}, author = {Wang, Yancheng and Goel, Rajeev and Jojic, Marko and Silva, Alvin C. and Wu, Teresa and Yang, Yingzhen}, booktitle = {Proceedings of the Forty-first Conference on Uncertainty in Artificial Intelligence}, pages = {4489--4514}, year = {2025}, editor = {Chiappa, Silvia and Magliacane, Sara}, volume = {286}, series = {Proceedings of Machine Learning Research}, month = {21--25 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v286/main/assets/wang25g/wang25g.pdf}, url = {https://proceedings.mlr.press/v286/wang25g.html}, abstract = {Deep Neural Networks (DNNs), including architectures such as Vision Transformers (ViTs), have achieved remarkable success in medical imaging tasks. However, their performance typically hinges on the availability of large-scale, high-quality labeled datasets-resources that are often scarce or infeasible to obtain in medical domains. Generative Data Augmentation (GDA) offers a promising remedy by supplementing training sets with synthetic data generated via generative models like Diffusion Models (DMs). Yet, this approach introduces a critical challenge: synthetic data often contains significant noise, which can degrade the performance of classifiers trained on such augmented datasets. Prior solutions, including data selection and re-weighting techniques, often rely on access to clean metadata or pretrained external classifiers. In this work, we propose \emph{Informative Data Selection} (IDS), a principled sample re-weighting framework grounded in the Information Bottleneck (IB) principle. IDS assigns higher weights to more informative synthetic samples, thereby improving classifier performance in GDA-enhanced training for thorax disease classification. Extensive experiments demonstrate that IDS significantly outperforms existing data selection and re-weighting baselines. Our code is publicly available at \url{https://github.com/Statistical-Deep-Learning/IDS}.} }
Endnote
%0 Conference Paper %T Informative Synthetic Data Generation for Thorax Disease Classification %A Yancheng Wang %A Rajeev Goel %A Marko Jojic %A Alvin C. Silva %A Teresa Wu %A Yingzhen Yang %B Proceedings of the Forty-first Conference on Uncertainty in Artificial Intelligence %C Proceedings of Machine Learning Research %D 2025 %E Silvia Chiappa %E Sara Magliacane %F pmlr-v286-wang25g %I PMLR %P 4489--4514 %U https://proceedings.mlr.press/v286/wang25g.html %V 286 %X Deep Neural Networks (DNNs), including architectures such as Vision Transformers (ViTs), have achieved remarkable success in medical imaging tasks. However, their performance typically hinges on the availability of large-scale, high-quality labeled datasets-resources that are often scarce or infeasible to obtain in medical domains. Generative Data Augmentation (GDA) offers a promising remedy by supplementing training sets with synthetic data generated via generative models like Diffusion Models (DMs). Yet, this approach introduces a critical challenge: synthetic data often contains significant noise, which can degrade the performance of classifiers trained on such augmented datasets. Prior solutions, including data selection and re-weighting techniques, often rely on access to clean metadata or pretrained external classifiers. In this work, we propose \emph{Informative Data Selection} (IDS), a principled sample re-weighting framework grounded in the Information Bottleneck (IB) principle. IDS assigns higher weights to more informative synthetic samples, thereby improving classifier performance in GDA-enhanced training for thorax disease classification. Extensive experiments demonstrate that IDS significantly outperforms existing data selection and re-weighting baselines. Our code is publicly available at \url{https://github.com/Statistical-Deep-Learning/IDS}.
APA
Wang, Y., Goel, R., Jojic, M., Silva, A.C., Wu, T. & Yang, Y.. (2025). Informative Synthetic Data Generation for Thorax Disease Classification. Proceedings of the Forty-first Conference on Uncertainty in Artificial Intelligence, in Proceedings of Machine Learning Research 286:4489-4514 Available from https://proceedings.mlr.press/v286/wang25g.html.

Related Material