Disentangling dataset size from synthetic diversity in tuberculosis chest X-ray classification

Connor Pink; Ed Sykes

Disentangling dataset size from synthetic diversity in tuberculosis chest X-ray classification

Connor Pink, Ed Sykes

Proceedings of the The 39th Canadian Conference on Artificial Intelligence, PMLR 318:954-961, 2026.

Abstract

Synthetic data augmentation is often proposed as a remedy for limited and imbalanced medical imaging datasets. We study tuberculosis detection on the Tuberculosis Chest X-ray Database by training a 256$\times$256 WGAN-GP and a 512$\times$512 latent diffusion model fine-tuned from RoentGen-v2. We evaluated both for image quality and downstream utility. On generative metrics, diffusion outperforms WGAN-GP, achieving lower FID (6.56 vs. 9.28) and substantially lower radiology-aligned Rad-Dino FID (117.89 vs. 201.78), along with higher SSIM/MS-SSIM under our deterministic gen–real pairing protocol. However, in controlled DenseNet-121 classifier experiments under a fixed optimization budget (4,000 steps with identical selection criteria), synthetic augmentation does not outperform a count-matched duplicate-real control at matched dataset size. The duplicate-real control yields the best downstream performance despite adding no new information (e.g., 0.9981 $\pm$ 0.0013 test AUPRC at r = 5), while the best synthetic setting is diffusion at low ratio (r = 0.25). Increasing the synthetic-to-real ratio is not beneficial: high synthetic proportions degrade downstream performance, with particu- larly sharp deterioration for WGAN-GP at large ratios. Overall, the study demonstrates that superiority on generative metrics does not guarantee downstream benefit and highlights the importance of rigorous, count-matched augmentation controls when claiming gains from synthetic data.

Cite this Paper

BibTeX

@InProceedings{pmlr-v318-pink26a,
  title = 	 {Disentangling dataset size from synthetic diversity in tuberculosis chest X-ray classification},
  author =       {Pink, Connor and Sykes, Ed},
  booktitle = 	 {Proceedings of the The 39th Canadian Conference on Artificial Intelligence},
  pages = 	 {954--961},
  year = 	 {2026},
  editor = 	 {Bouzar-Benlabiod, Lydia and Leung, Carson},
  volume = 	 {318},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {25--29 May},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v318/main/assets/pink26a/pink26a.pdf},
  url = 	 {https://proceedings.mlr.press/v318/pink26a.html},
  abstract = 	 {Synthetic data augmentation is often proposed as a remedy for limited and imbalanced medical imaging datasets. We study tuberculosis detection on the Tuberculosis Chest X-ray Database by training a 256$\times$256 WGAN-GP and a 512$\times$512 latent diffusion model fine-tuned from RoentGen-v2. We evaluated both for image quality and downstream utility. On generative metrics, diffusion outperforms WGAN-GP, achieving lower FID (6.56 vs. 9.28) and substantially lower radiology-aligned Rad-Dino FID (117.89 vs. 201.78), along with higher SSIM/MS-SSIM under our deterministic gen–real pairing protocol. However, in controlled DenseNet-121 classifier experiments under a fixed optimization budget (4,000 steps with identical selection criteria), synthetic augmentation does not outperform a count-matched duplicate-real control at matched dataset size. The duplicate-real control yields the best downstream performance despite adding no new information (e.g., 0.9981 $\pm$ 0.0013 test AUPRC at r = 5), while the best synthetic setting is diffusion at low ratio (r = 0.25). Increasing the synthetic-to-real ratio is not beneficial: high synthetic proportions degrade downstream performance, with particu- larly sharp deterioration for WGAN-GP at large ratios. Overall, the study demonstrates that superiority on generative metrics does not guarantee downstream benefit and highlights the importance of rigorous, count-matched augmentation controls when claiming gains from synthetic data.}
}

Endnote

%0 Conference Paper
%T Disentangling dataset size from synthetic diversity in tuberculosis chest X-ray classification
%A Connor Pink
%A Ed Sykes
%B Proceedings of the The 39th Canadian Conference on Artificial Intelligence
%C Proceedings of Machine Learning Research
%D 2026
%E Lydia Bouzar-Benlabiod
%E Carson Leung	
%F pmlr-v318-pink26a
%I PMLR
%P 954--961
%U https://proceedings.mlr.press/v318/pink26a.html
%V 318
%X Synthetic data augmentation is often proposed as a remedy for limited and imbalanced medical imaging datasets. We study tuberculosis detection on the Tuberculosis Chest X-ray Database by training a 256$\times$256 WGAN-GP and a 512$\times$512 latent diffusion model fine-tuned from RoentGen-v2. We evaluated both for image quality and downstream utility. On generative metrics, diffusion outperforms WGAN-GP, achieving lower FID (6.56 vs. 9.28) and substantially lower radiology-aligned Rad-Dino FID (117.89 vs. 201.78), along with higher SSIM/MS-SSIM under our deterministic gen–real pairing protocol. However, in controlled DenseNet-121 classifier experiments under a fixed optimization budget (4,000 steps with identical selection criteria), synthetic augmentation does not outperform a count-matched duplicate-real control at matched dataset size. The duplicate-real control yields the best downstream performance despite adding no new information (e.g., 0.9981 $\pm$ 0.0013 test AUPRC at r = 5), while the best synthetic setting is diffusion at low ratio (r = 0.25). Increasing the synthetic-to-real ratio is not beneficial: high synthetic proportions degrade downstream performance, with particu- larly sharp deterioration for WGAN-GP at large ratios. Overall, the study demonstrates that superiority on generative metrics does not guarantee downstream benefit and highlights the importance of rigorous, count-matched augmentation controls when claiming gains from synthetic data.

APA

Pink, C. & Sykes, E.. (2026). Disentangling dataset size from synthetic diversity in tuberculosis chest X-ray classification. Proceedings of the The 39th Canadian Conference on Artificial Intelligence, in Proceedings of Machine Learning Research 318:954-961 Available from https://proceedings.mlr.press/v318/pink26a.html.

Related Material

Download PDF