A Tool Bottleneck Framework for Clinically-Informed and Interpretable Medical Image Understanding

Christina Liu; Alan Q. Wang; Joy Hsu; Jiajun Wu; Ehsan Adeli

A Tool Bottleneck Framework for Clinically-Informed and Interpretable Medical Image Understanding

Christina Liu, Alan Q. Wang, Joy Hsu, Jiajun Wu, Ehsan Adeli

Proceedings of The 9th International Conference on Medical Imaging with Deep Learning, PMLR 315:2958-2986, 2026.

Abstract

Recent tool-use frameworks powered by vision-language models (VLMs) improve image understanding by grounding model predictions with specialized tools. Broadly, these frameworks leverage VLMs and a pre-specified toolbox to decompose the prediction task into multiple tool calls (often deep learning models) which are composed to make a prediction. The dominant approach to composing tools is using text, via function calls embedded in VLM-generated code or natural language. However, these methods often perform poorly on medical image understanding, where salient information is encoded as spatially-localized features that are difficult to compose or fuse via text alone. To address this, we propose a tool-use framework for medical image understanding called the (), which composes VLM-selected tools using a learned Tool Bottleneck Model (TBM). For a given image and task, leverages an off-the-shelf medical VLM to select tools from a toolbox that each extract clinically-relevant features. Instead of text-based composition, these tools are composed by the TBM, which computes and fuses the tool outputs using a neural network before outputting the final prediction. We propose a simple and effective strategy for TBMs to make predictions with any arbitrary VLM tool selection. Overall, our framework not only improves tool-use in medical imaging contexts, but also yields more interpretable, clinically-grounded predictors. We evaluate on tasks in histopathology and dermatology and find that these advantages enable our framework to perform on par with or better than deep learning-based classifiers, VLMs, and state-of-the-art tool-use frameworks, with particular gains in data-limited regimes.

Cite this Paper

BibTeX

@InProceedings{pmlr-v315-liu26d,
  title = 	 {A Tool Bottleneck Framework for Clinically-Informed and Interpretable Medical Image Understanding},
  author =       {Liu, Christina and Wang, Alan Q. and Hsu, Joy and Wu, Jiajun and Adeli, Ehsan},
  booktitle = 	 {Proceedings of The 9th International Conference on Medical Imaging with Deep Learning},
  pages = 	 {2958--2986},
  year = 	 {2026},
  editor = 	 {Huo, Yuankai and Gao, Mingchen and Kuo, Chang-Fu and Jin, Yueming and Deng, Ruining},
  volume = 	 {315},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {08--10 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v315/main/assets/liu26d/liu26d.pdf},
  url = 	 {https://proceedings.mlr.press/v315/liu26d.html},
  abstract = 	 {Recent tool-use frameworks powered by vision-language models (VLMs) improve image understanding by grounding model predictions with specialized tools. Broadly, these frameworks leverage VLMs and a pre-specified toolbox to decompose the prediction task into multiple tool calls (often deep learning models) which are composed to make a prediction. The dominant approach to composing tools is using text, via function calls embedded in VLM-generated code or natural language. However, these methods often perform poorly on medical image understanding, where salient information is encoded as spatially-localized features that are difficult to compose or fuse via text alone. To address this, we propose a tool-use framework for medical image understanding called the (), which composes VLM-selected tools using a learned Tool Bottleneck Model (TBM). For a given image and task, leverages an off-the-shelf medical VLM to select tools from a toolbox that each extract clinically-relevant features. Instead of text-based composition, these tools are composed by the TBM, which computes and fuses the tool outputs using a neural network before outputting the final prediction. We propose a simple and effective strategy for TBMs to make predictions with any arbitrary VLM tool selection. Overall, our framework not only improves tool-use in medical imaging contexts, but also yields more interpretable, clinically-grounded predictors. We evaluate on tasks in histopathology and dermatology and find that these advantages enable our framework to perform on par with or better than deep learning-based classifiers, VLMs, and state-of-the-art tool-use frameworks, with particular gains in data-limited regimes.}
}

Endnote

%0 Conference Paper
%T A Tool Bottleneck Framework for Clinically-Informed and Interpretable Medical Image Understanding
%A Christina Liu
%A Alan Q. Wang
%A Joy Hsu
%A Jiajun Wu
%A Ehsan Adeli
%B Proceedings of The 9th International Conference on Medical Imaging with Deep Learning
%C Proceedings of Machine Learning Research
%D 2026
%E Yuankai Huo
%E Mingchen Gao
%E Chang-Fu Kuo
%E Yueming Jin
%E Ruining Deng	
%F pmlr-v315-liu26d
%I PMLR
%P 2958--2986
%U https://proceedings.mlr.press/v315/liu26d.html
%V 315
%X Recent tool-use frameworks powered by vision-language models (VLMs) improve image understanding by grounding model predictions with specialized tools. Broadly, these frameworks leverage VLMs and a pre-specified toolbox to decompose the prediction task into multiple tool calls (often deep learning models) which are composed to make a prediction. The dominant approach to composing tools is using text, via function calls embedded in VLM-generated code or natural language. However, these methods often perform poorly on medical image understanding, where salient information is encoded as spatially-localized features that are difficult to compose or fuse via text alone. To address this, we propose a tool-use framework for medical image understanding called the (), which composes VLM-selected tools using a learned Tool Bottleneck Model (TBM). For a given image and task, leverages an off-the-shelf medical VLM to select tools from a toolbox that each extract clinically-relevant features. Instead of text-based composition, these tools are composed by the TBM, which computes and fuses the tool outputs using a neural network before outputting the final prediction. We propose a simple and effective strategy for TBMs to make predictions with any arbitrary VLM tool selection. Overall, our framework not only improves tool-use in medical imaging contexts, but also yields more interpretable, clinically-grounded predictors. We evaluate on tasks in histopathology and dermatology and find that these advantages enable our framework to perform on par with or better than deep learning-based classifiers, VLMs, and state-of-the-art tool-use frameworks, with particular gains in data-limited regimes.

APA

Liu, C., Wang, A.Q., Hsu, J., Wu, J. & Adeli, E.. (2026). A Tool Bottleneck Framework for Clinically-Informed and Interpretable Medical Image Understanding. Proceedings of The 9th International Conference on Medical Imaging with Deep Learning, in Proceedings of Machine Learning Research 315:2958-2986 Available from https://proceedings.mlr.press/v315/liu26d.html.

Related Material

Download PDF