RadImageNet-VQA: A Large-Scale CT and MRI Dataset for Radiologic Visual Question Answering

Léo Butsanets, Charles Corbière, Julien Khlaut, Pierre Manceron, Corentin Dancette
Proceedings of The 9th International Conference on Medical Imaging with Deep Learning, PMLR 315:3036-3068, 2026.

Abstract

In this work, we introduce RadImageNet-VQA, a large-scale dataset designed to advance radiologic visual question answering (VQA) on CT and MRI exams. While existing medical VQA datasets are limited in scale, dominated by X-ray imaging or biomedical illustrations, and prone to text-based shortcuts, RadImageNet-VQA is built from expert-curated annotations and provides 750K images paired with 7.5M QA samples. It covers three key tasks—abnormality detection, anatomy recognition, and pathology identification—spanning 8 anatomical regions and 97 pathology categories, and supports open-ended, closed-ended, and multiple-choice questions. Extensive experiments show that state-of-the-art vision-language models still struggle with fine-grained pathology identification, especially in open-ended settings and even after fine-tuning. Text-only analysis further reveals that model accuracies collapse to near-random without image inputs, confirming that RadImageNet-VQA is free from linguistic shortcuts.

Cite this Paper


BibTeX
@InProceedings{pmlr-v315-butsanets26a, title = {RadImageNet-VQA: A Large-Scale CT and MRI Dataset for Radiologic Visual Question Answering}, author = {Butsanets, L{\'e}o and Corbi{\`e}re, Charles and Khlaut, Julien and Manceron, Pierre and Dancette, Corentin}, booktitle = {Proceedings of The 9th International Conference on Medical Imaging with Deep Learning}, pages = {3036--3068}, year = {2026}, editor = {Huo, Yuankai and Gao, Mingchen and Kuo, Chang-Fu and Jin, Yueming and Deng, Ruining}, volume = {315}, series = {Proceedings of Machine Learning Research}, month = {08--10 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v315/main/assets/butsanets26a/butsanets26a.pdf}, url = {https://proceedings.mlr.press/v315/butsanets26a.html}, abstract = {In this work, we introduce RadImageNet-VQA, a large-scale dataset designed to advance radiologic visual question answering (VQA) on CT and MRI exams. While existing medical VQA datasets are limited in scale, dominated by X-ray imaging or biomedical illustrations, and prone to text-based shortcuts, RadImageNet-VQA is built from expert-curated annotations and provides 750K images paired with 7.5M QA samples. It covers three key tasks—abnormality detection, anatomy recognition, and pathology identification—spanning 8 anatomical regions and 97 pathology categories, and supports open-ended, closed-ended, and multiple-choice questions. Extensive experiments show that state-of-the-art vision-language models still struggle with fine-grained pathology identification, especially in open-ended settings and even after fine-tuning. Text-only analysis further reveals that model accuracies collapse to near-random without image inputs, confirming that RadImageNet-VQA is free from linguistic shortcuts.} }
Endnote
%0 Conference Paper %T RadImageNet-VQA: A Large-Scale CT and MRI Dataset for Radiologic Visual Question Answering %A Léo Butsanets %A Charles Corbière %A Julien Khlaut %A Pierre Manceron %A Corentin Dancette %B Proceedings of The 9th International Conference on Medical Imaging with Deep Learning %C Proceedings of Machine Learning Research %D 2026 %E Yuankai Huo %E Mingchen Gao %E Chang-Fu Kuo %E Yueming Jin %E Ruining Deng %F pmlr-v315-butsanets26a %I PMLR %P 3036--3068 %U https://proceedings.mlr.press/v315/butsanets26a.html %V 315 %X In this work, we introduce RadImageNet-VQA, a large-scale dataset designed to advance radiologic visual question answering (VQA) on CT and MRI exams. While existing medical VQA datasets are limited in scale, dominated by X-ray imaging or biomedical illustrations, and prone to text-based shortcuts, RadImageNet-VQA is built from expert-curated annotations and provides 750K images paired with 7.5M QA samples. It covers three key tasks—abnormality detection, anatomy recognition, and pathology identification—spanning 8 anatomical regions and 97 pathology categories, and supports open-ended, closed-ended, and multiple-choice questions. Extensive experiments show that state-of-the-art vision-language models still struggle with fine-grained pathology identification, especially in open-ended settings and even after fine-tuning. Text-only analysis further reveals that model accuracies collapse to near-random without image inputs, confirming that RadImageNet-VQA is free from linguistic shortcuts.
APA
Butsanets, L., Corbière, C., Khlaut, J., Manceron, P. & Dancette, C.. (2026). RadImageNet-VQA: A Large-Scale CT and MRI Dataset for Radiologic Visual Question Answering. Proceedings of The 9th International Conference on Medical Imaging with Deep Learning, in Proceedings of Machine Learning Research 315:3036-3068 Available from https://proceedings.mlr.press/v315/butsanets26a.html.

Related Material