Visual Prompt Engineering for Vision Language Models in Radiology

Stefan Denner, Markus Ralf Bujotzek, Dimitrios Bounias, David Zimmerer, Raphael Stock, Klaus Maier-Hein
Proceedings of The 8th International Conference on Medical Imaging with Deep Learning, PMLR 301:310-326, 2026.

Abstract

Medical image classification plays a crucial role in clinical decision-making, yet most models are constrained to a fixed set of predefined classes, limiting their adaptability to new conditions. Contrastive Language-Image Pretraining (CLIP) offers a promising solution by enabling zero-shot classification through multimodal large-scale pretraining. However, while CLIP effectively captures global image content, radiology requires a more localized focus on specific pathology regions to enhance both interpretability and diagnostic accuracy. To address this, we explore the potential of incorporating visual cues into zero-shot classification, embedding visual markers, such as arrows, bounding boxes, and circles, directly into radiological images to guide model attention. Evaluating across four public chest X-ray datasets, we demonstrate that visual markers improve AUROC by up to 0.185, highlighting their effectiveness in enhancing classification performance. Furthermore, attention map analysis confirms that visual cues help models focus on clinically relevant areas, leading to more interpretable predictions. To support further research, we use public datasets and provide our codebase and preprocessing pipeline, serving as a reference point for future work on localized classification in medical imaging.

Cite this Paper


BibTeX
@InProceedings{pmlr-v301-denner26a, title = {Visual Prompt Engineering for Vision Language Models in Radiology}, author = {Denner, Stefan and Bujotzek, Markus Ralf and Bounias, Dimitrios and Zimmerer, David and Stock, Raphael and Maier-Hein, Klaus}, booktitle = {Proceedings of The 8th International Conference on Medical Imaging with Deep Learning}, pages = {310--326}, year = {2026}, editor = {Tasdizen, Tolga and Elhabian, Shireen and Summers, Ronald and Chen, Chen and Koch, Lisa and Zhuang, Yan}, volume = {301}, series = {Proceedings of Machine Learning Research}, month = {09--11 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v301/main/assets/denner26a/denner26a.pdf}, url = {https://proceedings.mlr.press/v301/denner26a.html}, abstract = {Medical image classification plays a crucial role in clinical decision-making, yet most models are constrained to a fixed set of predefined classes, limiting their adaptability to new conditions. Contrastive Language-Image Pretraining (CLIP) offers a promising solution by enabling zero-shot classification through multimodal large-scale pretraining. However, while CLIP effectively captures global image content, radiology requires a more localized focus on specific pathology regions to enhance both interpretability and diagnostic accuracy. To address this, we explore the potential of incorporating visual cues into zero-shot classification, embedding visual markers, such as arrows, bounding boxes, and circles, directly into radiological images to guide model attention. Evaluating across four public chest X-ray datasets, we demonstrate that visual markers improve AUROC by up to 0.185, highlighting their effectiveness in enhancing classification performance. Furthermore, attention map analysis confirms that visual cues help models focus on clinically relevant areas, leading to more interpretable predictions. To support further research, we use public datasets and provide our codebase and preprocessing pipeline, serving as a reference point for future work on localized classification in medical imaging.} }
Endnote
%0 Conference Paper %T Visual Prompt Engineering for Vision Language Models in Radiology %A Stefan Denner %A Markus Ralf Bujotzek %A Dimitrios Bounias %A David Zimmerer %A Raphael Stock %A Klaus Maier-Hein %B Proceedings of The 8th International Conference on Medical Imaging with Deep Learning %C Proceedings of Machine Learning Research %D 2026 %E Tolga Tasdizen %E Shireen Elhabian %E Ronald Summers %E Chen Chen %E Lisa Koch %E Yan Zhuang %F pmlr-v301-denner26a %I PMLR %P 310--326 %U https://proceedings.mlr.press/v301/denner26a.html %V 301 %X Medical image classification plays a crucial role in clinical decision-making, yet most models are constrained to a fixed set of predefined classes, limiting their adaptability to new conditions. Contrastive Language-Image Pretraining (CLIP) offers a promising solution by enabling zero-shot classification through multimodal large-scale pretraining. However, while CLIP effectively captures global image content, radiology requires a more localized focus on specific pathology regions to enhance both interpretability and diagnostic accuracy. To address this, we explore the potential of incorporating visual cues into zero-shot classification, embedding visual markers, such as arrows, bounding boxes, and circles, directly into radiological images to guide model attention. Evaluating across four public chest X-ray datasets, we demonstrate that visual markers improve AUROC by up to 0.185, highlighting their effectiveness in enhancing classification performance. Furthermore, attention map analysis confirms that visual cues help models focus on clinically relevant areas, leading to more interpretable predictions. To support further research, we use public datasets and provide our codebase and preprocessing pipeline, serving as a reference point for future work on localized classification in medical imaging.
APA
Denner, S., Bujotzek, M.R., Bounias, D., Zimmerer, D., Stock, R. & Maier-Hein, K.. (2026). Visual Prompt Engineering for Vision Language Models in Radiology. Proceedings of The 8th International Conference on Medical Imaging with Deep Learning, in Proceedings of Machine Learning Research 301:310-326 Available from https://proceedings.mlr.press/v301/denner26a.html.

Related Material