Informative Synthetic Data Generation for Thorax Disease Classification

Yancheng Wang; Rajeev Goel; Marko Jojic; Alvin C. Silva; Teresa Wu; Yingzhen Yang

Informative Synthetic Data Generation for Thorax Disease Classification

Yancheng Wang, Rajeev Goel, Marko Jojic, Alvin C. Silva, Teresa Wu, Yingzhen Yang

Proceedings of the Forty-first Conference on Uncertainty in Artificial Intelligence, PMLR 286:4489-4514, 2025.

Abstract

Deep Neural Networks (DNNs), including architectures such as Vision Transformers (ViTs), have achieved remarkable success in medical imaging tasks. However, their performance typically hinges on the availability of large-scale, high-quality labeled datasets-resources that are often scarce or infeasible to obtain in medical domains. Generative Data Augmentation (GDA) offers a promising remedy by supplementing training sets with synthetic data generated via generative models like Diffusion Models (DMs). Yet, this approach introduces a critical challenge: synthetic data often contains significant noise, which can degrade the performance of classifiers trained on such augmented datasets. Prior solutions, including data selection and re-weighting techniques, often rely on access to clean metadata or pretrained external classifiers. In this work, we propose \emph{Informative Data Selection} (IDS), a principled sample re-weighting framework grounded in the Information Bottleneck (IB) principle. IDS assigns higher weights to more informative synthetic samples, thereby improving classifier performance in GDA-enhanced training for thorax disease classification. Extensive experiments demonstrate that IDS significantly outperforms existing data selection and re-weighting baselines. Our code is publicly available at \url{https://github.com/Statistical-Deep-Learning/IDS}.

Cite this Paper

BibTeX

@InProceedings{pmlr-v286-wang25g,
  title = 	 {Informative Synthetic Data Generation for Thorax Disease Classification},
  author =       {Wang, Yancheng and Goel, Rajeev and Jojic, Marko and Silva, Alvin C. and Wu, Teresa and Yang, Yingzhen},
  booktitle = 	 {Proceedings of the Forty-first Conference on Uncertainty in Artificial Intelligence},
  pages = 	 {4489--4514},
  year = 	 {2025},
  editor = 	 {Chiappa, Silvia and Magliacane, Sara},
  volume = 	 {286},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {21--25 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v286/main/assets/wang25g/wang25g.pdf},
  url = 	 {https://proceedings.mlr.press/v286/wang25g.html},
  abstract = 	 {Deep Neural Networks (DNNs), including architectures such as Vision Transformers (ViTs), have achieved remarkable success in medical imaging tasks. However, their performance typically hinges on the availability of large-scale, high-quality labeled datasets-resources that are often scarce or infeasible to obtain in medical domains. Generative Data Augmentation (GDA) offers a promising remedy by supplementing training sets with synthetic data generated via generative models like Diffusion Models (DMs). Yet, this approach introduces a critical challenge: synthetic data often contains significant noise, which can degrade the performance of classifiers trained on such augmented datasets. Prior solutions, including data selection and re-weighting techniques, often rely on access to clean metadata or pretrained external classifiers. In this work, we propose \emph{Informative Data Selection} (IDS), a principled sample re-weighting framework grounded in the Information Bottleneck (IB) principle. IDS assigns higher weights to more informative synthetic samples, thereby improving classifier performance in GDA-enhanced training for thorax disease classification. Extensive experiments demonstrate that IDS significantly outperforms existing data selection and re-weighting baselines. Our code is publicly available at \url{https://github.com/Statistical-Deep-Learning/IDS}.}
}

Endnote

%0 Conference Paper
%T Informative Synthetic Data Generation for Thorax Disease Classification
%A Yancheng Wang
%A Rajeev Goel
%A Marko Jojic
%A Alvin C. Silva
%A Teresa Wu
%A Yingzhen Yang
%B Proceedings of the Forty-first Conference on Uncertainty in Artificial Intelligence
%C Proceedings of Machine Learning Research
%D 2025
%E Silvia Chiappa
%E Sara Magliacane	
%F pmlr-v286-wang25g
%I PMLR
%P 4489--4514
%U https://proceedings.mlr.press/v286/wang25g.html
%V 286
%X Deep Neural Networks (DNNs), including architectures such as Vision Transformers (ViTs), have achieved remarkable success in medical imaging tasks. However, their performance typically hinges on the availability of large-scale, high-quality labeled datasets-resources that are often scarce or infeasible to obtain in medical domains. Generative Data Augmentation (GDA) offers a promising remedy by supplementing training sets with synthetic data generated via generative models like Diffusion Models (DMs). Yet, this approach introduces a critical challenge: synthetic data often contains significant noise, which can degrade the performance of classifiers trained on such augmented datasets. Prior solutions, including data selection and re-weighting techniques, often rely on access to clean metadata or pretrained external classifiers. In this work, we propose \emph{Informative Data Selection} (IDS), a principled sample re-weighting framework grounded in the Information Bottleneck (IB) principle. IDS assigns higher weights to more informative synthetic samples, thereby improving classifier performance in GDA-enhanced training for thorax disease classification. Extensive experiments demonstrate that IDS significantly outperforms existing data selection and re-weighting baselines. Our code is publicly available at \url{https://github.com/Statistical-Deep-Learning/IDS}.

APA

Wang, Y., Goel, R., Jojic, M., Silva, A.C., Wu, T. & Yang, Y.. (2025). Informative Synthetic Data Generation for Thorax Disease Classification. Proceedings of the Forty-first Conference on Uncertainty in Artificial Intelligence, in Proceedings of Machine Learning Research 286:4489-4514 Available from https://proceedings.mlr.press/v286/wang25g.html.

Informative Synthetic Data Generation for Thorax Disease Classification

Abstract

Cite this Paper

Related Material