[edit]
MedVLThinker: Simple Baselines for Multimodal Medical Reasoning
Proceedings of the Fifth Machine Learning for Health Symposium, PMLR 297:384-398, 2026.
Abstract
Large Reasoning Models ({LRM}s) have introduced a new paradigm in {AI} by enabling models to “think before responding” via chain-of-thought reasoning. However, the absence of open and reproducible recipes for building reasoning-centric medical {LMM}s hinders community-wide research, analysis, and comparison. In this paper, we present MedVLThinker, a suite of simple yet strong baselines. Our fully open recipe consists of: (1) systematic data curation for both text-only and image-text medical data, filtered according to varying levels of reasoning difficulty, and (2) two training paradigms: Supervised Fine-Tuning ({SFT}) on distilled reasoning traces and Reinforcement Learning with Verifiable Rewards ({RLVR}) based on final answer correctness. Across extensive experiments on the Qwen2.5-{VL} model family (3B, 7B) and six medical {QA} benchmarks, we find that {RLVR} consistently and significantly outperforms {SFT}. Additionally, under the {RLVR} framework, a key, counterintuitive finding is that training on our curated text-only reasoning data provides a more substantial performance boost than training on multimodal image-text data. Our best open 7B model, trained using the {RLVR} recipe on text-only data, establishes a new state-of-the-art on existing public {VQA} benchmarks, surpassing all previous open-source medical {LMM}s. Furthermore, scaling our model to 32B achieves performance on par with the proprietary {GPT}-4o. We release all curated data, models, and code to provide the community with a strong, open foundation for future research in multimodal medical reasoning.