Unveiling Differences: A Vision Encoder-Decoder Model for Difference Medical Visual Question Answering

Luis-Jesus Marhuenda, Miquel Obrador-Reina, Mohamed Aas-Alas, Alberto Albiol, Roberto Paredes
Proceedings of The 8th International Conference on Medical Imaging with Deep Learning, PMLR 301:1095-1106, 2026.

Abstract

Difference Medical Visual Question Answering (Diff-VQA), a specialized subfield of Medical VQA, tackles the critical task of identifying and describing differences between pairs of medical images. This study introduces a novel Vision Encoder-Decoder (VED) architecture tailored for this task, focusing on the comparison of chest X-ray images to detect and explain changes. The proposed model incorporates two key innovations: (1) a light-weight Transformer text decoder architecture capable of generating precise and contextually relevant answers to complex medical questions, and (2) an enhanced fusion mechanism that improves the model’s ability to distinguish between two input images, enabling more accurate comparison of radiological findings. Our approach excels in identifying significant changes, such as pneumonia and lung opacity, demonstrating its utility in automating preliminary radiological assessments. By leveraging large-scale, domain-specific datasets and employing advanced training strategies, our VED architecture achieves state-of-the-art performance on standard VQA metrics, setting a new benchmark in diagnostic accuracy. These advancements highlight the potential of Diff-VQA to enhance clinical workflows and support radiologists in making more precise, informed decisions.

Cite this Paper


BibTeX
@InProceedings{pmlr-v301-marhuenda26a, title = {Unveiling Differences: A Vision Encoder-Decoder Model for Difference Medical Visual Question Answering}, author = {Marhuenda, Luis-Jesus and Obrador-Reina, Miquel and Aas-Alas, Mohamed and Albiol, Alberto and Paredes, Roberto}, booktitle = {Proceedings of The 8th International Conference on Medical Imaging with Deep Learning}, pages = {1095--1106}, year = {2026}, editor = {Tasdizen, Tolga and Elhabian, Shireen and Summers, Ronald and Chen, Chen and Koch, Lisa and Zhuang, Yan}, volume = {301}, series = {Proceedings of Machine Learning Research}, month = {09--11 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v301/main/assets/marhuenda26a/marhuenda26a.pdf}, url = {https://proceedings.mlr.press/v301/marhuenda26a.html}, abstract = {Difference Medical Visual Question Answering (Diff-VQA), a specialized subfield of Medical VQA, tackles the critical task of identifying and describing differences between pairs of medical images. This study introduces a novel Vision Encoder-Decoder (VED) architecture tailored for this task, focusing on the comparison of chest X-ray images to detect and explain changes. The proposed model incorporates two key innovations: (1) a light-weight Transformer text decoder architecture capable of generating precise and contextually relevant answers to complex medical questions, and (2) an enhanced fusion mechanism that improves the model’s ability to distinguish between two input images, enabling more accurate comparison of radiological findings. Our approach excels in identifying significant changes, such as pneumonia and lung opacity, demonstrating its utility in automating preliminary radiological assessments. By leveraging large-scale, domain-specific datasets and employing advanced training strategies, our VED architecture achieves state-of-the-art performance on standard VQA metrics, setting a new benchmark in diagnostic accuracy. These advancements highlight the potential of Diff-VQA to enhance clinical workflows and support radiologists in making more precise, informed decisions.} }
Endnote
%0 Conference Paper %T Unveiling Differences: A Vision Encoder-Decoder Model for Difference Medical Visual Question Answering %A Luis-Jesus Marhuenda %A Miquel Obrador-Reina %A Mohamed Aas-Alas %A Alberto Albiol %A Roberto Paredes %B Proceedings of The 8th International Conference on Medical Imaging with Deep Learning %C Proceedings of Machine Learning Research %D 2026 %E Tolga Tasdizen %E Shireen Elhabian %E Ronald Summers %E Chen Chen %E Lisa Koch %E Yan Zhuang %F pmlr-v301-marhuenda26a %I PMLR %P 1095--1106 %U https://proceedings.mlr.press/v301/marhuenda26a.html %V 301 %X Difference Medical Visual Question Answering (Diff-VQA), a specialized subfield of Medical VQA, tackles the critical task of identifying and describing differences between pairs of medical images. This study introduces a novel Vision Encoder-Decoder (VED) architecture tailored for this task, focusing on the comparison of chest X-ray images to detect and explain changes. The proposed model incorporates two key innovations: (1) a light-weight Transformer text decoder architecture capable of generating precise and contextually relevant answers to complex medical questions, and (2) an enhanced fusion mechanism that improves the model’s ability to distinguish between two input images, enabling more accurate comparison of radiological findings. Our approach excels in identifying significant changes, such as pneumonia and lung opacity, demonstrating its utility in automating preliminary radiological assessments. By leveraging large-scale, domain-specific datasets and employing advanced training strategies, our VED architecture achieves state-of-the-art performance on standard VQA metrics, setting a new benchmark in diagnostic accuracy. These advancements highlight the potential of Diff-VQA to enhance clinical workflows and support radiologists in making more precise, informed decisions.
APA
Marhuenda, L., Obrador-Reina, M., Aas-Alas, M., Albiol, A. & Paredes, R.. (2026). Unveiling Differences: A Vision Encoder-Decoder Model for Difference Medical Visual Question Answering. Proceedings of The 8th International Conference on Medical Imaging with Deep Learning, in Proceedings of Machine Learning Research 301:1095-1106 Available from https://proceedings.mlr.press/v301/marhuenda26a.html.

Related Material