MApLe: Multi-instance Alignment of Diagnostic Reports and Large Medical Images

Felicia Bader, Philipp Seeböck, Anastasia Bartashova, Ulrike Attenberger, Georg Langs
Proceedings of The 9th International Conference on Medical Imaging with Deep Learning, PMLR 315:4102-4116, 2026.

Abstract

In diagnostic reports, experts encode complex imaging data into clinically actionable information. They describe subtle pathological findings that are meaningful in their anatomical context. Reports follow relatively consistent structures, expressing diagnostic information with few words that are often associated with tiny but consequential image observations. Standard vision language models struggle to identify the associations between these informative text components and small locations in the images. Here, we propose "MApLe", a multi-task, multi-instance vision language alignment approach that overcomes these limitations. It disentangles the concepts of anatomical region and diagnostic finding, and links local image information to sentences in a patch-wise approach. Our method consists of a text embedding trained to capture anatomical and diagnostic concepts in sentences, a patch-wise image encoder conditioned on anatomical structures, and a multi-instance alignment of these representations. We demonstrate that MApLe can successfully align different image regions and multiple diagnostic findings in free-text reports. We show that our model improves the alignment performance compared to state-of-the-art baseline models when evaluated on several downstream tasks.

Cite this Paper


BibTeX
@InProceedings{pmlr-v315-bader26a, title = {MApLe: Multi-instance Alignment of Diagnostic Reports and Large Medical Images}, author = {Bader, Felicia and Seeb\"ock, Philipp and Bartashova, Anastasia and Attenberger, Ulrike and Langs, Georg}, booktitle = {Proceedings of The 9th International Conference on Medical Imaging with Deep Learning}, pages = {4102--4116}, year = {2026}, editor = {Huo, Yuankai and Gao, Mingchen and Kuo, Chang-Fu and Jin, Yueming and Deng, Ruining}, volume = {315}, series = {Proceedings of Machine Learning Research}, month = {08--10 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v315/main/assets/bader26a/bader26a.pdf}, url = {https://proceedings.mlr.press/v315/bader26a.html}, abstract = {In diagnostic reports, experts encode complex imaging data into clinically actionable information. They describe subtle pathological findings that are meaningful in their anatomical context. Reports follow relatively consistent structures, expressing diagnostic information with few words that are often associated with tiny but consequential image observations. Standard vision language models struggle to identify the associations between these informative text components and small locations in the images. Here, we propose "MApLe", a multi-task, multi-instance vision language alignment approach that overcomes these limitations. It disentangles the concepts of anatomical region and diagnostic finding, and links local image information to sentences in a patch-wise approach. Our method consists of a text embedding trained to capture anatomical and diagnostic concepts in sentences, a patch-wise image encoder conditioned on anatomical structures, and a multi-instance alignment of these representations. We demonstrate that MApLe can successfully align different image regions and multiple diagnostic findings in free-text reports. We show that our model improves the alignment performance compared to state-of-the-art baseline models when evaluated on several downstream tasks.} }
Endnote
%0 Conference Paper %T MApLe: Multi-instance Alignment of Diagnostic Reports and Large Medical Images %A Felicia Bader %A Philipp Seeböck %A Anastasia Bartashova %A Ulrike Attenberger %A Georg Langs %B Proceedings of The 9th International Conference on Medical Imaging with Deep Learning %C Proceedings of Machine Learning Research %D 2026 %E Yuankai Huo %E Mingchen Gao %E Chang-Fu Kuo %E Yueming Jin %E Ruining Deng %F pmlr-v315-bader26a %I PMLR %P 4102--4116 %U https://proceedings.mlr.press/v315/bader26a.html %V 315 %X In diagnostic reports, experts encode complex imaging data into clinically actionable information. They describe subtle pathological findings that are meaningful in their anatomical context. Reports follow relatively consistent structures, expressing diagnostic information with few words that are often associated with tiny but consequential image observations. Standard vision language models struggle to identify the associations between these informative text components and small locations in the images. Here, we propose "MApLe", a multi-task, multi-instance vision language alignment approach that overcomes these limitations. It disentangles the concepts of anatomical region and diagnostic finding, and links local image information to sentences in a patch-wise approach. Our method consists of a text embedding trained to capture anatomical and diagnostic concepts in sentences, a patch-wise image encoder conditioned on anatomical structures, and a multi-instance alignment of these representations. We demonstrate that MApLe can successfully align different image regions and multiple diagnostic findings in free-text reports. We show that our model improves the alignment performance compared to state-of-the-art baseline models when evaluated on several downstream tasks.
APA
Bader, F., Seeböck, P., Bartashova, A., Attenberger, U. & Langs, G.. (2026). MApLe: Multi-instance Alignment of Diagnostic Reports and Large Medical Images. Proceedings of The 9th International Conference on Medical Imaging with Deep Learning, in Proceedings of Machine Learning Research 315:4102-4116 Available from https://proceedings.mlr.press/v315/bader26a.html.

Related Material