[edit]
Detecting Any instruction-to-answer interaction relationship:Universal Instruction-to-Answer Navigator for Med-VQA
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:53909-53927, 2024.
Abstract
Medical Visual Question Answering (Med-VQA) interprets complex medical imagery using user instructions for precise diagnostics, yet faces challenges due to diverse, inadequately annotated images. In this paper, we introduce the Universal Instruction-Vision Navigator (Uni-Med) framework for extracting instruction-to-answer relationships, facilitating the understanding of visual evidence behind responses. Specifically, we design the Instruct-to-Answer Clues Interpreter (IAI) to generate visual explanations based on the answers and mark the core part of instructions with "real intent" labels. The IAI-Med VQA dataset, produced using IAI, is now publicly available to advance Med-VQA research. Additionally, our Token-Level Cut-Mix module dynamically aligns visual explanations with image patches, ensuring answers are traceable and learnable. We also implement intention-guided attention to minimize non-core instruction interference, sharpening focus on ’real intent’. Extensive experiments on SLAKE datasets show Uni-Med’s superior accuracies (87.52% closed, 86.12% overall), outperforming MedVInT-PMC-VQA by 1.22% and 0.92%. Code and dataset are available at: https://github.com/zhongzee/Uni-Med-master.