MedVLThinker: Simple Baselines for Multimodal Medical Reasoning

Xiaoke Huang, Juncheng Wu, Hui Liu, Xianfeng Tang, Yuyin Zhou
Proceedings of the Fifth Machine Learning for Health Symposium, PMLR 297:384-398, 2026.

Abstract

Large Reasoning Models ({LRM}s) have introduced a new paradigm in {AI} by enabling models to “think before responding” via chain-of-thought reasoning. However, the absence of open and reproducible recipes for building reasoning-centric medical {LMM}s hinders community-wide research, analysis, and comparison. In this paper, we present MedVLThinker, a suite of simple yet strong baselines. Our fully open recipe consists of: (1) systematic data curation for both text-only and image-text medical data, filtered according to varying levels of reasoning difficulty, and (2) two training paradigms: Supervised Fine-Tuning ({SFT}) on distilled reasoning traces and Reinforcement Learning with Verifiable Rewards ({RLVR}) based on final answer correctness. Across extensive experiments on the Qwen2.5-{VL} model family (3B, 7B) and six medical {QA} benchmarks, we find that {RLVR} consistently and significantly outperforms {SFT}. Additionally, under the {RLVR} framework, a key, counterintuitive finding is that training on our curated text-only reasoning data provides a more substantial performance boost than training on multimodal image-text data. Our best open 7B model, trained using the {RLVR} recipe on text-only data, establishes a new state-of-the-art on existing public {VQA} benchmarks, surpassing all previous open-source medical {LMM}s. Furthermore, scaling our model to 32B achieves performance on par with the proprietary {GPT}-4o. We release all curated data, models, and code to provide the community with a strong, open foundation for future research in multimodal medical reasoning.

Cite this Paper


BibTeX
@InProceedings{pmlr-v297-huang26b, title = {MedVLThinker: Simple Baselines for Multimodal Medical Reasoning}, author = {Huang, Xiaoke and Wu, Juncheng and Liu, Hui and Tang, Xianfeng and Zhou, Yuyin}, booktitle = {Proceedings of the Fifth Machine Learning for Health Symposium}, pages = {384--398}, year = {2026}, editor = {Argaw, Peniel and Zhang, Haoran and Jabbour, Sarah and Chandak, Payal and Ji, Jerry and Mukherjee, Sumit and Salaudeen, Olawale and Chang, Trenton and Healey, Elizabeth and Gröger, Fabian and Adibi, Amin and Hegselmann, Stefan and Wild, Benjamin and Noori, Ayush}, volume = {297}, series = {Proceedings of Machine Learning Research}, month = {13--14 Dec}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v297/main/assets/huang26b/huang26b.pdf}, url = {https://proceedings.mlr.press/v297/huang26b.html}, abstract = {Large Reasoning Models ({LRM}s) have introduced a new paradigm in {AI} by enabling models to “think before responding” via chain-of-thought reasoning. However, the absence of open and reproducible recipes for building reasoning-centric medical {LMM}s hinders community-wide research, analysis, and comparison. In this paper, we present MedVLThinker, a suite of simple yet strong baselines. Our fully open recipe consists of: (1) systematic data curation for both text-only and image-text medical data, filtered according to varying levels of reasoning difficulty, and (2) two training paradigms: Supervised Fine-Tuning ({SFT}) on distilled reasoning traces and Reinforcement Learning with Verifiable Rewards ({RLVR}) based on final answer correctness. Across extensive experiments on the Qwen2.5-{VL} model family (3B, 7B) and six medical {QA} benchmarks, we find that {RLVR} consistently and significantly outperforms {SFT}. Additionally, under the {RLVR} framework, a key, counterintuitive finding is that training on our curated text-only reasoning data provides a more substantial performance boost than training on multimodal image-text data. Our best open 7B model, trained using the {RLVR} recipe on text-only data, establishes a new state-of-the-art on existing public {VQA} benchmarks, surpassing all previous open-source medical {LMM}s. Furthermore, scaling our model to 32B achieves performance on par with the proprietary {GPT}-4o. We release all curated data, models, and code to provide the community with a strong, open foundation for future research in multimodal medical reasoning.} }
Endnote
%0 Conference Paper %T MedVLThinker: Simple Baselines for Multimodal Medical Reasoning %A Xiaoke Huang %A Juncheng Wu %A Hui Liu %A Xianfeng Tang %A Yuyin Zhou %B Proceedings of the Fifth Machine Learning for Health Symposium %C Proceedings of Machine Learning Research %D 2026 %E Peniel Argaw %E Haoran Zhang %E Sarah Jabbour %E Payal Chandak %E Jerry Ji %E Sumit Mukherjee %E Olawale Salaudeen %E Trenton Chang %E Elizabeth Healey %E Fabian Gröger %E Amin Adibi %E Stefan Hegselmann %E Benjamin Wild %E Ayush Noori %F pmlr-v297-huang26b %I PMLR %P 384--398 %U https://proceedings.mlr.press/v297/huang26b.html %V 297 %X Large Reasoning Models ({LRM}s) have introduced a new paradigm in {AI} by enabling models to “think before responding” via chain-of-thought reasoning. However, the absence of open and reproducible recipes for building reasoning-centric medical {LMM}s hinders community-wide research, analysis, and comparison. In this paper, we present MedVLThinker, a suite of simple yet strong baselines. Our fully open recipe consists of: (1) systematic data curation for both text-only and image-text medical data, filtered according to varying levels of reasoning difficulty, and (2) two training paradigms: Supervised Fine-Tuning ({SFT}) on distilled reasoning traces and Reinforcement Learning with Verifiable Rewards ({RLVR}) based on final answer correctness. Across extensive experiments on the Qwen2.5-{VL} model family (3B, 7B) and six medical {QA} benchmarks, we find that {RLVR} consistently and significantly outperforms {SFT}. Additionally, under the {RLVR} framework, a key, counterintuitive finding is that training on our curated text-only reasoning data provides a more substantial performance boost than training on multimodal image-text data. Our best open 7B model, trained using the {RLVR} recipe on text-only data, establishes a new state-of-the-art on existing public {VQA} benchmarks, surpassing all previous open-source medical {LMM}s. Furthermore, scaling our model to 32B achieves performance on par with the proprietary {GPT}-4o. We release all curated data, models, and code to provide the community with a strong, open foundation for future research in multimodal medical reasoning.
APA
Huang, X., Wu, J., Liu, H., Tang, X. & Zhou, Y.. (2026). MedVLThinker: Simple Baselines for Multimodal Medical Reasoning. Proceedings of the Fifth Machine Learning for Health Symposium, in Proceedings of Machine Learning Research 297:384-398 Available from https://proceedings.mlr.press/v297/huang26b.html.

Related Material