Semi-Synthetic Localization Datasets for Radiological Findings on Chest X-Rays

Andrea Posada, Johannes Brandt, Friederike Jungmann, Maria Posada, Daniel Rueckert, Martin J. Menten, Felix Meissen, Philip Müller
Proceedings of The 9th International Conference on Medical Imaging with Deep Learning, PMLR 315:2834-2863, 2026.

Abstract

While large datasets for chest X-ray (CXR) finding classification are widely available, datasets for finding localization are scarce. Curating these localization datasets is costly and time-intensive, requiring manual annotation by medical experts, which often results in them being small and limited in scope. To overcome this, we introduce SemiSynCXR, a framework designed to automatically generate semi-synthetic localization datasets. SemiSynCXR operates by inpainting specific radiological findings into real, healthy CXRs at anatomically plausible locations, which allows for the output of both the edited image and the ground-truth bounding box for each finding. SemiSynCXR-generated CXRs effectively augment existing localization datasets, yielding relative mAP$_{10:70}$ gains of up to 11% on in-domain and 21% on out-of-domain data, thereby mitigating data scarcity and improving generalization. Comprehensive quantitative and qualitative evaluations show that our framework achieves an overall AUROC of 0.78 and mAP$_{10:70}$ of 0.45, comparable to fully synthetic benchmarks. These results confirm that the generated findings are realistic and accurately localized, establishing SemiSynCXR as a practical solution for the generation of CXR finding localization datasets.

Cite this Paper


BibTeX
@InProceedings{pmlr-v315-posada26a, title = {Semi-Synthetic Localization Datasets for Radiological Findings on Chest X-Rays}, author = {Posada, Andrea and Brandt, Johannes and Jungmann, Friederike and Posada, Maria and Rueckert, Daniel and Menten, Martin J. and Meissen, Felix and M\"uller, Philip}, booktitle = {Proceedings of The 9th International Conference on Medical Imaging with Deep Learning}, pages = {2834--2863}, year = {2026}, editor = {Huo, Yuankai and Gao, Mingchen and Kuo, Chang-Fu and Jin, Yueming and Deng, Ruining}, volume = {315}, series = {Proceedings of Machine Learning Research}, month = {08--10 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v315/main/assets/posada26a/posada26a.pdf}, url = {https://proceedings.mlr.press/v315/posada26a.html}, abstract = {While large datasets for chest X-ray (CXR) finding classification are widely available, datasets for finding localization are scarce. Curating these localization datasets is costly and time-intensive, requiring manual annotation by medical experts, which often results in them being small and limited in scope. To overcome this, we introduce SemiSynCXR, a framework designed to automatically generate semi-synthetic localization datasets. SemiSynCXR operates by inpainting specific radiological findings into real, healthy CXRs at anatomically plausible locations, which allows for the output of both the edited image and the ground-truth bounding box for each finding. SemiSynCXR-generated CXRs effectively augment existing localization datasets, yielding relative mAP$_{10:70}$ gains of up to 11% on in-domain and 21% on out-of-domain data, thereby mitigating data scarcity and improving generalization. Comprehensive quantitative and qualitative evaluations show that our framework achieves an overall AUROC of 0.78 and mAP$_{10:70}$ of 0.45, comparable to fully synthetic benchmarks. These results confirm that the generated findings are realistic and accurately localized, establishing SemiSynCXR as a practical solution for the generation of CXR finding localization datasets.} }
Endnote
%0 Conference Paper %T Semi-Synthetic Localization Datasets for Radiological Findings on Chest X-Rays %A Andrea Posada %A Johannes Brandt %A Friederike Jungmann %A Maria Posada %A Daniel Rueckert %A Martin J. Menten %A Felix Meissen %A Philip Müller %B Proceedings of The 9th International Conference on Medical Imaging with Deep Learning %C Proceedings of Machine Learning Research %D 2026 %E Yuankai Huo %E Mingchen Gao %E Chang-Fu Kuo %E Yueming Jin %E Ruining Deng %F pmlr-v315-posada26a %I PMLR %P 2834--2863 %U https://proceedings.mlr.press/v315/posada26a.html %V 315 %X While large datasets for chest X-ray (CXR) finding classification are widely available, datasets for finding localization are scarce. Curating these localization datasets is costly and time-intensive, requiring manual annotation by medical experts, which often results in them being small and limited in scope. To overcome this, we introduce SemiSynCXR, a framework designed to automatically generate semi-synthetic localization datasets. SemiSynCXR operates by inpainting specific radiological findings into real, healthy CXRs at anatomically plausible locations, which allows for the output of both the edited image and the ground-truth bounding box for each finding. SemiSynCXR-generated CXRs effectively augment existing localization datasets, yielding relative mAP$_{10:70}$ gains of up to 11% on in-domain and 21% on out-of-domain data, thereby mitigating data scarcity and improving generalization. Comprehensive quantitative and qualitative evaluations show that our framework achieves an overall AUROC of 0.78 and mAP$_{10:70}$ of 0.45, comparable to fully synthetic benchmarks. These results confirm that the generated findings are realistic and accurately localized, establishing SemiSynCXR as a practical solution for the generation of CXR finding localization datasets.
APA
Posada, A., Brandt, J., Jungmann, F., Posada, M., Rueckert, D., Menten, M.J., Meissen, F. & Müller, P.. (2026). Semi-Synthetic Localization Datasets for Radiological Findings on Chest X-Rays. Proceedings of The 9th International Conference on Medical Imaging with Deep Learning, in Proceedings of Machine Learning Research 315:2834-2863 Available from https://proceedings.mlr.press/v315/posada26a.html.

Related Material