Detecting Any instruction-to-answer interaction relationship:Universal Instruction-to-Answer Navigator for Med-VQA

Zhongze Wu, Hongyan Xu, Yitian Long, Shan You, Xiu Su, Jun Long, Yueyi Luo, Chang Xu
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:53909-53927, 2024.

Abstract

Medical Visual Question Answering (Med-VQA) interprets complex medical imagery using user instructions for precise diagnostics, yet faces challenges due to diverse, inadequately annotated images. In this paper, we introduce the Universal Instruction-Vision Navigator (Uni-Med) framework for extracting instruction-to-answer relationships, facilitating the understanding of visual evidence behind responses. Specifically, we design the Instruct-to-Answer Clues Interpreter (IAI) to generate visual explanations based on the answers and mark the core part of instructions with "real intent" labels. The IAI-Med VQA dataset, produced using IAI, is now publicly available to advance Med-VQA research. Additionally, our Token-Level Cut-Mix module dynamically aligns visual explanations with image patches, ensuring answers are traceable and learnable. We also implement intention-guided attention to minimize non-core instruction interference, sharpening focus on ’real intent’. Extensive experiments on SLAKE datasets show Uni-Med’s superior accuracies (87.52% closed, 86.12% overall), outperforming MedVInT-PMC-VQA by 1.22% and 0.92%. Code and dataset are available at: https://github.com/zhongzee/Uni-Med-master.

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-wu24ac, title = {Detecting Any instruction-to-answer interaction relationship:{U}niversal Instruction-to-Answer Navigator for Med-{VQA}}, author = {Wu, Zhongze and Xu, Hongyan and Long, Yitian and You, Shan and Su, Xiu and Long, Jun and Luo, Yueyi and Xu, Chang}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {53909--53927}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/wu24ac/wu24ac.pdf}, url = {https://proceedings.mlr.press/v235/wu24ac.html}, abstract = {Medical Visual Question Answering (Med-VQA) interprets complex medical imagery using user instructions for precise diagnostics, yet faces challenges due to diverse, inadequately annotated images. In this paper, we introduce the Universal Instruction-Vision Navigator (Uni-Med) framework for extracting instruction-to-answer relationships, facilitating the understanding of visual evidence behind responses. Specifically, we design the Instruct-to-Answer Clues Interpreter (IAI) to generate visual explanations based on the answers and mark the core part of instructions with "real intent" labels. The IAI-Med VQA dataset, produced using IAI, is now publicly available to advance Med-VQA research. Additionally, our Token-Level Cut-Mix module dynamically aligns visual explanations with image patches, ensuring answers are traceable and learnable. We also implement intention-guided attention to minimize non-core instruction interference, sharpening focus on ’real intent’. Extensive experiments on SLAKE datasets show Uni-Med’s superior accuracies (87.52% closed, 86.12% overall), outperforming MedVInT-PMC-VQA by 1.22% and 0.92%. Code and dataset are available at: https://github.com/zhongzee/Uni-Med-master.} }
Endnote
%0 Conference Paper %T Detecting Any instruction-to-answer interaction relationship:Universal Instruction-to-Answer Navigator for Med-VQA %A Zhongze Wu %A Hongyan Xu %A Yitian Long %A Shan You %A Xiu Su %A Jun Long %A Yueyi Luo %A Chang Xu %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-wu24ac %I PMLR %P 53909--53927 %U https://proceedings.mlr.press/v235/wu24ac.html %V 235 %X Medical Visual Question Answering (Med-VQA) interprets complex medical imagery using user instructions for precise diagnostics, yet faces challenges due to diverse, inadequately annotated images. In this paper, we introduce the Universal Instruction-Vision Navigator (Uni-Med) framework for extracting instruction-to-answer relationships, facilitating the understanding of visual evidence behind responses. Specifically, we design the Instruct-to-Answer Clues Interpreter (IAI) to generate visual explanations based on the answers and mark the core part of instructions with "real intent" labels. The IAI-Med VQA dataset, produced using IAI, is now publicly available to advance Med-VQA research. Additionally, our Token-Level Cut-Mix module dynamically aligns visual explanations with image patches, ensuring answers are traceable and learnable. We also implement intention-guided attention to minimize non-core instruction interference, sharpening focus on ’real intent’. Extensive experiments on SLAKE datasets show Uni-Med’s superior accuracies (87.52% closed, 86.12% overall), outperforming MedVInT-PMC-VQA by 1.22% and 0.92%. Code and dataset are available at: https://github.com/zhongzee/Uni-Med-master.
APA
Wu, Z., Xu, H., Long, Y., You, S., Su, X., Long, J., Luo, Y. & Xu, C.. (2024). Detecting Any instruction-to-answer interaction relationship:Universal Instruction-to-Answer Navigator for Med-VQA. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:53909-53927 Available from https://proceedings.mlr.press/v235/wu24ac.html.

Related Material