Scaling Test-Time Compute Without Verification or RL is Suboptimal

Amrith Setlur, Nived Rajaraman, Sergey Levine, Aviral Kumar
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:54058-54094, 2025.

Abstract

Despite substantial advances in scaling test-time compute, an ongoing debate in the community is how it should be scaled up to enable continued and efficient improvements with scaling. There are largely two approaches: (i) distilling successful search or thinking traces; and (ii), using verification (e.g., 0/1 outcome rewards, or verifiers) to guide reinforcement learning (RL) and search algorithms. In this paper, we prove that finetuning LLMs with verifier-based (VB) methods based on RL or search is far superior to verifier-free (VF) approaches based on distilling or cloning search traces, given a fixed amount of compute/data budget. Further, we show that as we scale test-time compute (measured as the output token length) and training data, suboptimality of VF methods scales poorly compared to VB when the base pre-trained LLM presents a heterogeneous distribution over correct solution traces (e.g., different lengths, styles, etc.) and admits a non-sharp distribution over rewards on traces sampled from it. We formalize this condition using anti-concentration [Erdős 1945], implying a stronger result that VB methods scale better asymptotically, with the performance gap between VB and VF widening as test-time budget grows. We corroborate our theory empirically on didactic and math reasoning problems with 3/8/32B-sized pre-trained LLMs, where we find verification is crucial for scaling test-time compute.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-setlur25a, title = {Scaling Test-Time Compute Without Verification or {RL} is Suboptimal}, author = {Setlur, Amrith and Rajaraman, Nived and Levine, Sergey and Kumar, Aviral}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {54058--54094}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/setlur25a/setlur25a.pdf}, url = {https://proceedings.mlr.press/v267/setlur25a.html}, abstract = {Despite substantial advances in scaling test-time compute, an ongoing debate in the community is how it should be scaled up to enable continued and efficient improvements with scaling. There are largely two approaches: (i) distilling successful search or thinking traces; and (ii), using verification (e.g., 0/1 outcome rewards, or verifiers) to guide reinforcement learning (RL) and search algorithms. In this paper, we prove that finetuning LLMs with verifier-based (VB) methods based on RL or search is far superior to verifier-free (VF) approaches based on distilling or cloning search traces, given a fixed amount of compute/data budget. Further, we show that as we scale test-time compute (measured as the output token length) and training data, suboptimality of VF methods scales poorly compared to VB when the base pre-trained LLM presents a heterogeneous distribution over correct solution traces (e.g., different lengths, styles, etc.) and admits a non-sharp distribution over rewards on traces sampled from it. We formalize this condition using anti-concentration [Erdős 1945], implying a stronger result that VB methods scale better asymptotically, with the performance gap between VB and VF widening as test-time budget grows. We corroborate our theory empirically on didactic and math reasoning problems with 3/8/32B-sized pre-trained LLMs, where we find verification is crucial for scaling test-time compute.} }
Endnote
%0 Conference Paper %T Scaling Test-Time Compute Without Verification or RL is Suboptimal %A Amrith Setlur %A Nived Rajaraman %A Sergey Levine %A Aviral Kumar %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-setlur25a %I PMLR %P 54058--54094 %U https://proceedings.mlr.press/v267/setlur25a.html %V 267 %X Despite substantial advances in scaling test-time compute, an ongoing debate in the community is how it should be scaled up to enable continued and efficient improvements with scaling. There are largely two approaches: (i) distilling successful search or thinking traces; and (ii), using verification (e.g., 0/1 outcome rewards, or verifiers) to guide reinforcement learning (RL) and search algorithms. In this paper, we prove that finetuning LLMs with verifier-based (VB) methods based on RL or search is far superior to verifier-free (VF) approaches based on distilling or cloning search traces, given a fixed amount of compute/data budget. Further, we show that as we scale test-time compute (measured as the output token length) and training data, suboptimality of VF methods scales poorly compared to VB when the base pre-trained LLM presents a heterogeneous distribution over correct solution traces (e.g., different lengths, styles, etc.) and admits a non-sharp distribution over rewards on traces sampled from it. We formalize this condition using anti-concentration [Erdős 1945], implying a stronger result that VB methods scale better asymptotically, with the performance gap between VB and VF widening as test-time budget grows. We corroborate our theory empirically on didactic and math reasoning problems with 3/8/32B-sized pre-trained LLMs, where we find verification is crucial for scaling test-time compute.
APA
Setlur, A., Rajaraman, N., Levine, S. & Kumar, A.. (2025). Scaling Test-Time Compute Without Verification or RL is Suboptimal. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:54058-54094 Available from https://proceedings.mlr.press/v267/setlur25a.html.

Related Material