Decoupling Vision and Reasoning: A Data-Efficient Pipeline for Surgical VQA

Mohamed Hamdy; Fatmaelzahraa Ali Ahmed; Muraam Abdel-Ghani; Muhammad Arsalan; Ponnuthurai Nagaratnam Suganthan; Khalid Al-Jalham; Abdulaziz Al-Ali; Shidin Balakrishnan

Decoupling Vision and Reasoning: A Data-Efficient Pipeline for Surgical VQA

Mohamed Hamdy, Fatmaelzahraa Ali Ahmed, Muraam Abdel-Ghani, Muhammad Arsalan, Ponnuthurai Nagaratnam Suganthan, Khalid Al-Jalham, Abdulaziz Al-Ali, Shidin Balakrishnan

Proceedings of The 9th International Conference on Medical Imaging with Deep Learning, PMLR 315:2038-2056, 2026.

Abstract

Vision-language models (VLMs) are becoming increasingly important for surgical intelligence, where reliable scene understanding requires combining visual perception with language-based reasoning. However, progress is constrained by the scarcity of high-quality multimodal datasets, making end-to-end training more prone to overfitting. Existing approaches often address this limitation by converting task-specific datasets (e.g., segmentation, phase recognition, tool-tissue interaction) into synthetic vision-question answering (VQA) form, but such conversions provide only sparse supervision and limit generalization. To overcome these challenges, we propose Surg-SAGE (Structured Abstraction from Granular Experts), a modular pipeline that decouples vision information extraction from reasoning. Specialist surgical models–proven effective for their corresponding vision tasks–are first used to extract task-relevant signals, which are then transformed via heuristics into structured textual descriptions. These descriptions, together with the clinical question, are passed to a large language model (LLM) that performs the reasoning step and provides the answer. The novelty of this work lies in demonstrating that decoupling perception from language processing and leveraging expert-trained specialist models enables strong VQA performance, even when paired with relatively lightweight, frozen LLMs and without requiring multimodal training data. We evaluate this pipeline on the EndoVis-18-VQA benchmark under different configurations of specialist models and LLMs, showing that combining complementary experts yields stronger performance than relying on any single model. Surg-SAGE achieves higher accuracy, recall and F1 than existing surgical VQA baselines, with improvements of up to 2.3% in accuracy without requiring multimodal training, establishing abstraction-driven modularity as a data-efficient and generalizable paradigm for surgical vision-language understanding.

Cite this Paper

BibTeX

@InProceedings{pmlr-v315-hamdy26a,
  title = 	 {Decoupling Vision and Reasoning: A Data-Efficient Pipeline for Surgical VQA},
  author =       {Hamdy, Mohamed and Ahmed, Fatmaelzahraa Ali and Abdel-Ghani, Muraam and Arsalan, Muhammad and Suganthan, Ponnuthurai Nagaratnam and Al-Jalham, Khalid and Al-Ali, Abdulaziz and Balakrishnan, Shidin},
  booktitle = 	 {Proceedings of The 9th International Conference on Medical Imaging with Deep Learning},
  pages = 	 {2038--2056},
  year = 	 {2026},
  editor = 	 {Huo, Yuankai and Gao, Mingchen and Kuo, Chang-Fu and Jin, Yueming and Deng, Ruining},
  volume = 	 {315},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {08--10 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v315/main/assets/hamdy26a/hamdy26a.pdf},
  url = 	 {https://proceedings.mlr.press/v315/hamdy26a.html},
  abstract = 	 {Vision-language models (VLMs) are becoming increasingly important for surgical intelligence, where reliable scene understanding requires combining visual perception with language-based reasoning. However, progress is constrained by the scarcity of high-quality multimodal datasets, making end-to-end training more prone to overfitting. Existing approaches often address this limitation by converting task-specific datasets (e.g., segmentation, phase recognition, tool-tissue interaction) into synthetic vision-question answering (VQA) form, but such conversions provide only sparse supervision and limit generalization. To overcome these challenges, we propose Surg-SAGE (Structured Abstraction from Granular Experts), a modular pipeline that decouples vision information extraction from reasoning. Specialist surgical models–proven effective for their corresponding vision tasks–are first used to extract task-relevant signals, which are then transformed via heuristics into structured textual descriptions. These descriptions, together with the clinical question, are passed to a large language model (LLM) that performs the reasoning step and provides the answer. The novelty of this work lies in demonstrating that decoupling perception from language processing and leveraging expert-trained specialist models enables strong VQA performance, even when paired with relatively lightweight, frozen LLMs and without requiring multimodal training data. We evaluate this pipeline on the EndoVis-18-VQA benchmark under different configurations of specialist models and LLMs, showing that combining complementary experts yields stronger performance than relying on any single model. Surg-SAGE achieves higher accuracy, recall and F1 than existing surgical VQA baselines, with improvements of up to 2.3% in accuracy without requiring multimodal training, establishing abstraction-driven modularity as a data-efficient and generalizable paradigm for surgical vision-language understanding.}
}

Endnote

%0 Conference Paper
%T Decoupling Vision and Reasoning: A Data-Efficient Pipeline for Surgical VQA
%A Mohamed Hamdy
%A Fatmaelzahraa Ali Ahmed
%A Muraam Abdel-Ghani
%A Muhammad Arsalan
%A Ponnuthurai Nagaratnam Suganthan
%A Khalid Al-Jalham
%A Abdulaziz Al-Ali
%A Shidin Balakrishnan
%B Proceedings of The 9th International Conference on Medical Imaging with Deep Learning
%C Proceedings of Machine Learning Research
%D 2026
%E Yuankai Huo
%E Mingchen Gao
%E Chang-Fu Kuo
%E Yueming Jin
%E Ruining Deng	
%F pmlr-v315-hamdy26a
%I PMLR
%P 2038--2056
%U https://proceedings.mlr.press/v315/hamdy26a.html
%V 315
%X Vision-language models (VLMs) are becoming increasingly important for surgical intelligence, where reliable scene understanding requires combining visual perception with language-based reasoning. However, progress is constrained by the scarcity of high-quality multimodal datasets, making end-to-end training more prone to overfitting. Existing approaches often address this limitation by converting task-specific datasets (e.g., segmentation, phase recognition, tool-tissue interaction) into synthetic vision-question answering (VQA) form, but such conversions provide only sparse supervision and limit generalization. To overcome these challenges, we propose Surg-SAGE (Structured Abstraction from Granular Experts), a modular pipeline that decouples vision information extraction from reasoning. Specialist surgical models–proven effective for their corresponding vision tasks–are first used to extract task-relevant signals, which are then transformed via heuristics into structured textual descriptions. These descriptions, together with the clinical question, are passed to a large language model (LLM) that performs the reasoning step and provides the answer. The novelty of this work lies in demonstrating that decoupling perception from language processing and leveraging expert-trained specialist models enables strong VQA performance, even when paired with relatively lightweight, frozen LLMs and without requiring multimodal training data. We evaluate this pipeline on the EndoVis-18-VQA benchmark under different configurations of specialist models and LLMs, showing that combining complementary experts yields stronger performance than relying on any single model. Surg-SAGE achieves higher accuracy, recall and F1 than existing surgical VQA baselines, with improvements of up to 2.3% in accuracy without requiring multimodal training, establishing abstraction-driven modularity as a data-efficient and generalizable paradigm for surgical vision-language understanding.

APA

Hamdy, M., Ahmed, F.A., Abdel-Ghani, M., Arsalan, M., Suganthan, P.N., Al-Jalham, K., Al-Ali, A. & Balakrishnan, S.. (2026). Decoupling Vision and Reasoning: A Data-Efficient Pipeline for Surgical VQA. Proceedings of The 9th International Conference on Medical Imaging with Deep Learning, in Proceedings of Machine Learning Research 315:2038-2056 Available from https://proceedings.mlr.press/v315/hamdy26a.html.

Related Material

Download PDF