[edit]
Test-Time Scaling in Clinical Decision Making
Proceedings of The 9th International Conference on Medical Imaging with Deep Learning, PMLR 315:2118-2142, 2026.
Abstract
Large language models (LLMs) have demonstrated remarkable capabilities in complex reasoning and knowledge-intensive tasks, yet their potential for clinical decision making through test-time scaling (TTS) remains largely unexplored. While TTS has shown promise in improving reasoning performance by leveraging additional inference-time computation, its effectiveness in the medical domain has not been systematically investigated. This gap is further exacerbated by the impracticality of supervised fine-tuning for clinical reasoning tasks, owing to limited data availability and high annotation costs. In this work, we present a comprehensive study of TTS for clinical decision making. We systematically investigate the interaction between TTS and inference strategies, including direct answering, chain-of-thought prompting, and two-stage reasoning. We generate multiple candidate outputs in parallel using large reasoning models and aggregate them via self-consistency decoding. This approach does not need any supervision while it leverages additional inference-time computation to improve the performance. We provide a comprehensive empirical evaluation across both text-based medical question answering benchmarks and medical imaging modalities, demonstrating consistent improvements over single-pass inference baselines with performance gains of up to 30 percentage points. Finally, we provide an analytical characterization of TTS, deriving scaling laws that describe how performance improves with the number of samples and identifying conditions under which TTS yields reliable gains, along with empirical validation on diverse medical decision-making tasks.