[edit]
Semi-Synthetic Localization Datasets for Radiological Findings on Chest X-Rays
Proceedings of The 9th International Conference on Medical Imaging with Deep Learning, PMLR 315:2834-2863, 2026.
Abstract
While large datasets for chest X-ray (CXR) finding classification are widely available, datasets for finding localization are scarce. Curating these localization datasets is costly and time-intensive, requiring manual annotation by medical experts, which often results in them being small and limited in scope. To overcome this, we introduce SemiSynCXR, a framework designed to automatically generate semi-synthetic localization datasets. SemiSynCXR operates by inpainting specific radiological findings into real, healthy CXRs at anatomically plausible locations, which allows for the output of both the edited image and the ground-truth bounding box for each finding. SemiSynCXR-generated CXRs effectively augment existing localization datasets, yielding relative mAP$_{10:70}$ gains of up to 11% on in-domain and 21% on out-of-domain data, thereby mitigating data scarcity and improving generalization. Comprehensive quantitative and qualitative evaluations show that our framework achieves an overall AUROC of 0.78 and mAP$_{10:70}$ of 0.45, comparable to fully synthetic benchmarks. These results confirm that the generated findings are realistic and accurately localized, establishing SemiSynCXR as a practical solution for the generation of CXR finding localization datasets.