[edit]
Decoupling Vision and Reasoning: A Data-Efficient Pipeline for Surgical VQA
Proceedings of The 9th International Conference on Medical Imaging with Deep Learning, PMLR 315:2038-2056, 2026.
Abstract
Vision-language models (VLMs) are becoming increasingly important for surgical intelligence, where reliable scene understanding requires combining visual perception with language-based reasoning. However, progress is constrained by the scarcity of high-quality multimodal datasets, making end-to-end training more prone to overfitting. Existing approaches often address this limitation by converting task-specific datasets (e.g., segmentation, phase recognition, tool-tissue interaction) into synthetic vision-question answering (VQA) form, but such conversions provide only sparse supervision and limit generalization. To overcome these challenges, we propose Surg-SAGE (Structured Abstraction from Granular Experts), a modular pipeline that decouples vision information extraction from reasoning. Specialist surgical models–proven effective for their corresponding vision tasks–are first used to extract task-relevant signals, which are then transformed via heuristics into structured textual descriptions. These descriptions, together with the clinical question, are passed to a large language model (LLM) that performs the reasoning step and provides the answer. The novelty of this work lies in demonstrating that decoupling perception from language processing and leveraging expert-trained specialist models enables strong VQA performance, even when paired with relatively lightweight, frozen LLMs and without requiring multimodal training data. We evaluate this pipeline on the EndoVis-18-VQA benchmark under different configurations of specialist models and LLMs, showing that combining complementary experts yields stronger performance than relying on any single model. Surg-SAGE achieves higher accuracy, recall and F1 than existing surgical VQA baselines, with improvements of up to 2.3% in accuracy without requiring multimodal training, establishing abstraction-driven modularity as a data-efficient and generalizable paradigm for surgical vision-language understanding.