Disentangling dataset size from synthetic diversity in tuberculosis chest X-ray classification

Connor Pink, Ed Sykes
Proceedings of the The 39th Canadian Conference on Artificial Intelligence, PMLR 318:954-961, 2026.

Abstract

Synthetic data augmentation is often proposed as a remedy for limited and imbalanced medical imaging datasets. We study tuberculosis detection on the Tuberculosis Chest X-ray Database by training a 256$\times$256 WGAN-GP and a 512$\times$512 latent diffusion model fine-tuned from RoentGen-v2. We evaluated both for image quality and downstream utility. On generative metrics, diffusion outperforms WGAN-GP, achieving lower FID (6.56 vs. 9.28) and substantially lower radiology-aligned Rad-Dino FID (117.89 vs. 201.78), along with higher SSIM/MS-SSIM under our deterministic gen–real pairing protocol. However, in controlled DenseNet-121 classifier experiments under a fixed optimization budget (4,000 steps with identical selection criteria), synthetic augmentation does not outperform a count-matched duplicate-real control at matched dataset size. The duplicate-real control yields the best downstream performance despite adding no new information (e.g., 0.9981 $\pm$ 0.0013 test AUPRC at r = 5), while the best synthetic setting is diffusion at low ratio (r = 0.25). Increasing the synthetic-to-real ratio is not beneficial: high synthetic proportions degrade downstream performance, with particu- larly sharp deterioration for WGAN-GP at large ratios. Overall, the study demonstrates that superiority on generative metrics does not guarantee downstream benefit and highlights the importance of rigorous, count-matched augmentation controls when claiming gains from synthetic data.

Cite this Paper


BibTeX
@InProceedings{pmlr-v318-pink26a, title = {Disentangling dataset size from synthetic diversity in tuberculosis chest X-ray classification}, author = {Pink, Connor and Sykes, Ed}, booktitle = {Proceedings of the The 39th Canadian Conference on Artificial Intelligence}, pages = {954--961}, year = {2026}, editor = {Bouzar-Benlabiod, Lydia and Leung, Carson}, volume = {318}, series = {Proceedings of Machine Learning Research}, month = {25--29 May}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v318/main/assets/pink26a/pink26a.pdf}, url = {https://proceedings.mlr.press/v318/pink26a.html}, abstract = {Synthetic data augmentation is often proposed as a remedy for limited and imbalanced medical imaging datasets. We study tuberculosis detection on the Tuberculosis Chest X-ray Database by training a 256$\times$256 WGAN-GP and a 512$\times$512 latent diffusion model fine-tuned from RoentGen-v2. We evaluated both for image quality and downstream utility. On generative metrics, diffusion outperforms WGAN-GP, achieving lower FID (6.56 vs. 9.28) and substantially lower radiology-aligned Rad-Dino FID (117.89 vs. 201.78), along with higher SSIM/MS-SSIM under our deterministic gen–real pairing protocol. However, in controlled DenseNet-121 classifier experiments under a fixed optimization budget (4,000 steps with identical selection criteria), synthetic augmentation does not outperform a count-matched duplicate-real control at matched dataset size. The duplicate-real control yields the best downstream performance despite adding no new information (e.g., 0.9981 $\pm$ 0.0013 test AUPRC at r = 5), while the best synthetic setting is diffusion at low ratio (r = 0.25). Increasing the synthetic-to-real ratio is not beneficial: high synthetic proportions degrade downstream performance, with particu- larly sharp deterioration for WGAN-GP at large ratios. Overall, the study demonstrates that superiority on generative metrics does not guarantee downstream benefit and highlights the importance of rigorous, count-matched augmentation controls when claiming gains from synthetic data.} }
Endnote
%0 Conference Paper %T Disentangling dataset size from synthetic diversity in tuberculosis chest X-ray classification %A Connor Pink %A Ed Sykes %B Proceedings of the The 39th Canadian Conference on Artificial Intelligence %C Proceedings of Machine Learning Research %D 2026 %E Lydia Bouzar-Benlabiod %E Carson Leung %F pmlr-v318-pink26a %I PMLR %P 954--961 %U https://proceedings.mlr.press/v318/pink26a.html %V 318 %X Synthetic data augmentation is often proposed as a remedy for limited and imbalanced medical imaging datasets. We study tuberculosis detection on the Tuberculosis Chest X-ray Database by training a 256$\times$256 WGAN-GP and a 512$\times$512 latent diffusion model fine-tuned from RoentGen-v2. We evaluated both for image quality and downstream utility. On generative metrics, diffusion outperforms WGAN-GP, achieving lower FID (6.56 vs. 9.28) and substantially lower radiology-aligned Rad-Dino FID (117.89 vs. 201.78), along with higher SSIM/MS-SSIM under our deterministic gen–real pairing protocol. However, in controlled DenseNet-121 classifier experiments under a fixed optimization budget (4,000 steps with identical selection criteria), synthetic augmentation does not outperform a count-matched duplicate-real control at matched dataset size. The duplicate-real control yields the best downstream performance despite adding no new information (e.g., 0.9981 $\pm$ 0.0013 test AUPRC at r = 5), while the best synthetic setting is diffusion at low ratio (r = 0.25). Increasing the synthetic-to-real ratio is not beneficial: high synthetic proportions degrade downstream performance, with particu- larly sharp deterioration for WGAN-GP at large ratios. Overall, the study demonstrates that superiority on generative metrics does not guarantee downstream benefit and highlights the importance of rigorous, count-matched augmentation controls when claiming gains from synthetic data.
APA
Pink, C. & Sykes, E.. (2026). Disentangling dataset size from synthetic diversity in tuberculosis chest X-ray classification. Proceedings of the The 39th Canadian Conference on Artificial Intelligence, in Proceedings of Machine Learning Research 318:954-961 Available from https://proceedings.mlr.press/v318/pink26a.html.

Related Material