Beyond Diagnosis: Evaluating Multimodal LLMs for Pathology Localization in Chest Radiographs

Advait Gosai, Arun Kavishwar, Stephanie L. McNamara, Soujanya Samineni, Renato Umeton, Alexander Chowdhury, William Lotter
Proceedings of the Fifth Machine Learning for Health Symposium, PMLR 297:572-587, 2026.

Abstract

Recent work has shown promising performance of frontier large language models ({LLM}s) and their multimodal counterparts in medical quizzes and diagnostic tasks, highlighting their potential for broad clinical utility given their accessible, general-purpose nature. However, beyond diagnosis, a fundamental aspect of medical image interpretation is the ability to localize pathological findings. Evaluating localization not only has clinical and educational relevance but also provides insight into a model’s spatial understanding of anatomy and disease. Here, we systematically assess two general-purpose {MLLM}s (GPT-4 and GPT-5) and a domain-specific model (MedGemma) in their ability to localize pathologies on chest radiographs, using a prompting pipeline that overlays a spatial grid and elicits coordinate-based predictions. Averaged across nine pathologies in the CheXlocalize dataset, GPT-5 exhibited a localization accuracy of 49.7%, followed by GPT-4 (39.1%) and MedGemma (17.7%), all lower than a task-specific {CNN} baseline (59.9%) and a radiologist benchmark (80.1%). Despite modest performance, error analysis revealed that GPT-5’s predictions were largely in anatomically plausible regions, just not always precisely localized. GPT-4 performed well on pathologies with fixed anatomical locations, but struggled with spatially variable findings and exhibited anatomically implausible predictions more frequently. MedGemma demonstrated the lowest performance on all pathologies, but showed improvements when provided examples through few shot prompting. Our findings highlight both the promise and limitations of current {MLLM}s in medical imaging and underscore the importance of integrating them with task-specific tools for reliable use.

Cite this Paper


BibTeX
@InProceedings{pmlr-v297-gosai26a, title = {Beyond Diagnosis: Evaluating Multimodal {LLM}s for Pathology Localization in Chest Radiographs}, author = {Gosai, Advait and Kavishwar, Arun and McNamara, Stephanie L. and Samineni, Soujanya and Umeton, Renato and Chowdhury, Alexander and Lotter, William}, booktitle = {Proceedings of the Fifth Machine Learning for Health Symposium}, pages = {572--587}, year = {2026}, editor = {Argaw, Peniel and Zhang, Haoran and Jabbour, Sarah and Chandak, Payal and Ji, Jerry and Mukherjee, Sumit and Salaudeen, Olawale and Chang, Trenton and Healey, Elizabeth and Gröger, Fabian and Adibi, Amin and Hegselmann, Stefan and Wild, Benjamin and Noori, Ayush}, volume = {297}, series = {Proceedings of Machine Learning Research}, month = {13--14 Dec}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v297/main/assets/gosai26a/gosai26a.pdf}, url = {https://proceedings.mlr.press/v297/gosai26a.html}, abstract = {Recent work has shown promising performance of frontier large language models ({LLM}s) and their multimodal counterparts in medical quizzes and diagnostic tasks, highlighting their potential for broad clinical utility given their accessible, general-purpose nature. However, beyond diagnosis, a fundamental aspect of medical image interpretation is the ability to localize pathological findings. Evaluating localization not only has clinical and educational relevance but also provides insight into a model’s spatial understanding of anatomy and disease. Here, we systematically assess two general-purpose {MLLM}s (GPT-4 and GPT-5) and a domain-specific model (MedGemma) in their ability to localize pathologies on chest radiographs, using a prompting pipeline that overlays a spatial grid and elicits coordinate-based predictions. Averaged across nine pathologies in the CheXlocalize dataset, GPT-5 exhibited a localization accuracy of 49.7%, followed by GPT-4 (39.1%) and MedGemma (17.7%), all lower than a task-specific {CNN} baseline (59.9%) and a radiologist benchmark (80.1%). Despite modest performance, error analysis revealed that GPT-5’s predictions were largely in anatomically plausible regions, just not always precisely localized. GPT-4 performed well on pathologies with fixed anatomical locations, but struggled with spatially variable findings and exhibited anatomically implausible predictions more frequently. MedGemma demonstrated the lowest performance on all pathologies, but showed improvements when provided examples through few shot prompting. Our findings highlight both the promise and limitations of current {MLLM}s in medical imaging and underscore the importance of integrating them with task-specific tools for reliable use.} }
Endnote
%0 Conference Paper %T Beyond Diagnosis: Evaluating Multimodal LLMs for Pathology Localization in Chest Radiographs %A Advait Gosai %A Arun Kavishwar %A Stephanie L. McNamara %A Soujanya Samineni %A Renato Umeton %A Alexander Chowdhury %A William Lotter %B Proceedings of the Fifth Machine Learning for Health Symposium %C Proceedings of Machine Learning Research %D 2026 %E Peniel Argaw %E Haoran Zhang %E Sarah Jabbour %E Payal Chandak %E Jerry Ji %E Sumit Mukherjee %E Olawale Salaudeen %E Trenton Chang %E Elizabeth Healey %E Fabian Gröger %E Amin Adibi %E Stefan Hegselmann %E Benjamin Wild %E Ayush Noori %F pmlr-v297-gosai26a %I PMLR %P 572--587 %U https://proceedings.mlr.press/v297/gosai26a.html %V 297 %X Recent work has shown promising performance of frontier large language models ({LLM}s) and their multimodal counterparts in medical quizzes and diagnostic tasks, highlighting their potential for broad clinical utility given their accessible, general-purpose nature. However, beyond diagnosis, a fundamental aspect of medical image interpretation is the ability to localize pathological findings. Evaluating localization not only has clinical and educational relevance but also provides insight into a model’s spatial understanding of anatomy and disease. Here, we systematically assess two general-purpose {MLLM}s (GPT-4 and GPT-5) and a domain-specific model (MedGemma) in their ability to localize pathologies on chest radiographs, using a prompting pipeline that overlays a spatial grid and elicits coordinate-based predictions. Averaged across nine pathologies in the CheXlocalize dataset, GPT-5 exhibited a localization accuracy of 49.7%, followed by GPT-4 (39.1%) and MedGemma (17.7%), all lower than a task-specific {CNN} baseline (59.9%) and a radiologist benchmark (80.1%). Despite modest performance, error analysis revealed that GPT-5’s predictions were largely in anatomically plausible regions, just not always precisely localized. GPT-4 performed well on pathologies with fixed anatomical locations, but struggled with spatially variable findings and exhibited anatomically implausible predictions more frequently. MedGemma demonstrated the lowest performance on all pathologies, but showed improvements when provided examples through few shot prompting. Our findings highlight both the promise and limitations of current {MLLM}s in medical imaging and underscore the importance of integrating them with task-specific tools for reliable use.
APA
Gosai, A., Kavishwar, A., McNamara, S.L., Samineni, S., Umeton, R., Chowdhury, A. & Lotter, W.. (2026). Beyond Diagnosis: Evaluating Multimodal LLMs for Pathology Localization in Chest Radiographs. Proceedings of the Fifth Machine Learning for Health Symposium, in Proceedings of Machine Learning Research 297:572-587 Available from https://proceedings.mlr.press/v297/gosai26a.html.

Related Material