Scaling Test-Time Compute Without Verification or RL is Suboptimal

Amrith Setlur; Nived Rajaraman; Sergey Levine; Aviral Kumar

Scaling Test-Time Compute Without Verification or RL is Suboptimal

Amrith Setlur, Nived Rajaraman, Sergey Levine, Aviral Kumar

Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:54058-54094, 2025.

Abstract

Despite substantial advances in scaling test-time compute, an ongoing debate in the community is how it should be scaled up to enable continued and efficient improvements with scaling. There are largely two approaches: (i) distilling successful search or thinking traces; and (ii), using verification (e.g., 0/1 outcome rewards, or verifiers) to guide reinforcement learning (RL) and search algorithms. In this paper, we prove that finetuning LLMs with verifier-based (VB) methods based on RL or search is far superior to verifier-free (VF) approaches based on distilling or cloning search traces, given a fixed amount of compute/data budget. Further, we show that as we scale test-time compute (measured as the output token length) and training data, suboptimality of VF methods scales poorly compared to VB when the base pre-trained LLM presents a heterogeneous distribution over correct solution traces (e.g., different lengths, styles, etc.) and admits a non-sharp distribution over rewards on traces sampled from it. We formalize this condition using anti-concentration [Erdős 1945], implying a stronger result that VB methods scale better asymptotically, with the performance gap between VB and VF widening as test-time budget grows. We corroborate our theory empirically on didactic and math reasoning problems with 3/8/32B-sized pre-trained LLMs, where we find verification is crucial for scaling test-time compute.

Cite this Paper

BibTeX

@InProceedings{pmlr-v267-setlur25a,
  title = 	 {Scaling Test-Time Compute Without Verification or {RL} is Suboptimal},
  author =       {Setlur, Amrith and Rajaraman, Nived and Levine, Sergey and Kumar, Aviral},
  booktitle = 	 {Proceedings of the 42nd International Conference on Machine Learning},
  pages = 	 {54058--54094},
  year = 	 {2025},
  editor = 	 {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry},
  volume = 	 {267},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {13--19 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v267/main/assets/setlur25a/setlur25a.pdf},
  url = 	 {https://proceedings.mlr.press/v267/setlur25a.html},
  abstract = 	 {Despite substantial advances in scaling test-time compute, an ongoing debate in the community is how it should be scaled up to enable continued and efficient improvements with scaling. There are largely two approaches: (i) distilling successful search or thinking traces; and (ii), using verification (e.g., 0/1 outcome rewards, or verifiers) to guide reinforcement learning (RL) and search algorithms. In this paper, we prove that finetuning LLMs with verifier-based (VB) methods based on RL or search is far superior to verifier-free (VF) approaches based on distilling or cloning search traces, given a fixed amount of compute/data budget. Further, we show that as we scale test-time compute (measured as the output token length) and training data, suboptimality of VF methods scales poorly compared to VB when the base pre-trained LLM presents a heterogeneous distribution over correct solution traces (e.g., different lengths, styles, etc.) and admits a non-sharp distribution over rewards on traces sampled from it. We formalize this condition using anti-concentration [Erdős 1945], implying a stronger result that VB methods scale better asymptotically, with the performance gap between VB and VF widening as test-time budget grows. We corroborate our theory empirically on didactic and math reasoning problems with 3/8/32B-sized pre-trained LLMs, where we find verification is crucial for scaling test-time compute.}
}

Endnote

%0 Conference Paper
%T Scaling Test-Time Compute Without Verification or RL is Suboptimal
%A Amrith Setlur
%A Nived Rajaraman
%A Sergey Levine
%A Aviral Kumar
%B Proceedings of the 42nd International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2025
%E Aarti Singh
%E Maryam Fazel
%E Daniel Hsu
%E Simon Lacoste-Julien
%E Felix Berkenkamp
%E Tegan Maharaj
%E Kiri Wagstaff
%E Jerry Zhu	
%F pmlr-v267-setlur25a
%I PMLR
%P 54058--54094
%U https://proceedings.mlr.press/v267/setlur25a.html
%V 267
%X Despite substantial advances in scaling test-time compute, an ongoing debate in the community is how it should be scaled up to enable continued and efficient improvements with scaling. There are largely two approaches: (i) distilling successful search or thinking traces; and (ii), using verification (e.g., 0/1 outcome rewards, or verifiers) to guide reinforcement learning (RL) and search algorithms. In this paper, we prove that finetuning LLMs with verifier-based (VB) methods based on RL or search is far superior to verifier-free (VF) approaches based on distilling or cloning search traces, given a fixed amount of compute/data budget. Further, we show that as we scale test-time compute (measured as the output token length) and training data, suboptimality of VF methods scales poorly compared to VB when the base pre-trained LLM presents a heterogeneous distribution over correct solution traces (e.g., different lengths, styles, etc.) and admits a non-sharp distribution over rewards on traces sampled from it. We formalize this condition using anti-concentration [Erdős 1945], implying a stronger result that VB methods scale better asymptotically, with the performance gap between VB and VF widening as test-time budget grows. We corroborate our theory empirically on didactic and math reasoning problems with 3/8/32B-sized pre-trained LLMs, where we find verification is crucial for scaling test-time compute.

APA

Setlur, A., Rajaraman, N., Levine, S. & Kumar, A.. (2025). Scaling Test-Time Compute Without Verification or RL is Suboptimal. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:54058-54094 Available from https://proceedings.mlr.press/v267/setlur25a.html.

Scaling Test-Time Compute Without Verification or RL is Suboptimal

Abstract

Cite this Paper

Related Material