Test-Time Scaling in Clinical Decision Making

Ji Young Byun, Young-Jin Park, Navid Azizan, Rama Chellappa
Proceedings of The 9th International Conference on Medical Imaging with Deep Learning, PMLR 315:2118-2142, 2026.

Abstract

Large language models (LLMs) have demonstrated remarkable capabilities in complex reasoning and knowledge-intensive tasks, yet their potential for clinical decision making through test-time scaling (TTS) remains largely unexplored. While TTS has shown promise in improving reasoning performance by leveraging additional inference-time computation, its effectiveness in the medical domain has not been systematically investigated. This gap is further exacerbated by the impracticality of supervised fine-tuning for clinical reasoning tasks, owing to limited data availability and high annotation costs. In this work, we present a comprehensive study of TTS for clinical decision making. We systematically investigate the interaction between TTS and inference strategies, including direct answering, chain-of-thought prompting, and two-stage reasoning. We generate multiple candidate outputs in parallel using large reasoning models and aggregate them via self-consistency decoding. This approach does not need any supervision while it leverages additional inference-time computation to improve the performance. We provide a comprehensive empirical evaluation across both text-based medical question answering benchmarks and medical imaging modalities, demonstrating consistent improvements over single-pass inference baselines with performance gains of up to 30 percentage points. Finally, we provide an analytical characterization of TTS, deriving scaling laws that describe how performance improves with the number of samples and identifying conditions under which TTS yields reliable gains, along with empirical validation on diverse medical decision-making tasks.

Cite this Paper


BibTeX
@InProceedings{pmlr-v315-byun26a, title = {Test-Time Scaling in Clinical Decision Making}, author = {Byun, Ji Young and Park, Young-Jin and Azizan, Navid and Chellappa, Rama}, booktitle = {Proceedings of The 9th International Conference on Medical Imaging with Deep Learning}, pages = {2118--2142}, year = {2026}, editor = {Huo, Yuankai and Gao, Mingchen and Kuo, Chang-Fu and Jin, Yueming and Deng, Ruining}, volume = {315}, series = {Proceedings of Machine Learning Research}, month = {08--10 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v315/main/assets/byun26a/byun26a.pdf}, url = {https://proceedings.mlr.press/v315/byun26a.html}, abstract = {Large language models (LLMs) have demonstrated remarkable capabilities in complex reasoning and knowledge-intensive tasks, yet their potential for clinical decision making through test-time scaling (TTS) remains largely unexplored. While TTS has shown promise in improving reasoning performance by leveraging additional inference-time computation, its effectiveness in the medical domain has not been systematically investigated. This gap is further exacerbated by the impracticality of supervised fine-tuning for clinical reasoning tasks, owing to limited data availability and high annotation costs. In this work, we present a comprehensive study of TTS for clinical decision making. We systematically investigate the interaction between TTS and inference strategies, including direct answering, chain-of-thought prompting, and two-stage reasoning. We generate multiple candidate outputs in parallel using large reasoning models and aggregate them via self-consistency decoding. This approach does not need any supervision while it leverages additional inference-time computation to improve the performance. We provide a comprehensive empirical evaluation across both text-based medical question answering benchmarks and medical imaging modalities, demonstrating consistent improvements over single-pass inference baselines with performance gains of up to 30 percentage points. Finally, we provide an analytical characterization of TTS, deriving scaling laws that describe how performance improves with the number of samples and identifying conditions under which TTS yields reliable gains, along with empirical validation on diverse medical decision-making tasks.} }
Endnote
%0 Conference Paper %T Test-Time Scaling in Clinical Decision Making %A Ji Young Byun %A Young-Jin Park %A Navid Azizan %A Rama Chellappa %B Proceedings of The 9th International Conference on Medical Imaging with Deep Learning %C Proceedings of Machine Learning Research %D 2026 %E Yuankai Huo %E Mingchen Gao %E Chang-Fu Kuo %E Yueming Jin %E Ruining Deng %F pmlr-v315-byun26a %I PMLR %P 2118--2142 %U https://proceedings.mlr.press/v315/byun26a.html %V 315 %X Large language models (LLMs) have demonstrated remarkable capabilities in complex reasoning and knowledge-intensive tasks, yet their potential for clinical decision making through test-time scaling (TTS) remains largely unexplored. While TTS has shown promise in improving reasoning performance by leveraging additional inference-time computation, its effectiveness in the medical domain has not been systematically investigated. This gap is further exacerbated by the impracticality of supervised fine-tuning for clinical reasoning tasks, owing to limited data availability and high annotation costs. In this work, we present a comprehensive study of TTS for clinical decision making. We systematically investigate the interaction between TTS and inference strategies, including direct answering, chain-of-thought prompting, and two-stage reasoning. We generate multiple candidate outputs in parallel using large reasoning models and aggregate them via self-consistency decoding. This approach does not need any supervision while it leverages additional inference-time computation to improve the performance. We provide a comprehensive empirical evaluation across both text-based medical question answering benchmarks and medical imaging modalities, demonstrating consistent improvements over single-pass inference baselines with performance gains of up to 30 percentage points. Finally, we provide an analytical characterization of TTS, deriving scaling laws that describe how performance improves with the number of samples and identifying conditions under which TTS yields reliable gains, along with empirical validation on diverse medical decision-making tasks.
APA
Byun, J.Y., Park, Y., Azizan, N. & Chellappa, R.. (2026). Test-Time Scaling in Clinical Decision Making. Proceedings of The 9th International Conference on Medical Imaging with Deep Learning, in Proceedings of Machine Learning Research 315:2118-2142 Available from https://proceedings.mlr.press/v315/byun26a.html.

Related Material