Optimizing Test-Time Compute via Meta Reinforcement Finetuning

Yuxiao Qu; Matthew Y. R. Yang; Amrith Setlur; Lewis Tunstall; Edward Emanuel Beeching; Ruslan Salakhutdinov; Aviral Kumar

Optimizing Test-Time Compute via Meta Reinforcement Finetuning

Yuxiao Qu, Matthew Y. R. Yang, Amrith Setlur, Lewis Tunstall, Edward Emanuel Beeching, Ruslan Salakhutdinov, Aviral Kumar

Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:50893-50925, 2025.

Abstract

Training models to efficiently use test-time compute is crucial for improving the reasoning performance of LLMs. While current methods mostly do so via fine-tuning on search traces or running RL against the 0/1 outcome reward, do these approaches efficiently utilize test-time compute? Would these approaches continue to scale as the budget improves? In this paper, we try to answer these questions. We formalize the problem of optimizing test-time compute as a meta reinforcement learning (RL) problem, which provides a principled perspective on spending test-time compute from the lens of exploration and exploitation. It also motivates the use of cumulative regret to measure the efficacy of test-time compute by viewing a long output stream as consisting of several episodes from the model. While current state-of-the-art models do not optimize regret, we show that regret can be minimized by running final 0/1 reward RL regularized by a dense reward bonus, given by the "information gain" from each subsequent block in the output stream. We prescribe an approach for quantifying information gain, which measures the utility of an intermediate segment of tokens towards improving accuracy of the final answer. We instantiate this idea to develop MRT, a new class of finetuning methods for optimizing test-time compute. Fine-tuning with MRT leads to substantial improvements in both performance and token efficiency on the AIME dataset.

Cite this Paper

BibTeX

@InProceedings{pmlr-v267-qu25g,
  title = 	 {Optimizing Test-Time Compute via Meta Reinforcement Finetuning},
  author =       {Qu, Yuxiao and Yang, Matthew Y. R. and Setlur, Amrith and Tunstall, Lewis and Beeching, Edward Emanuel and Salakhutdinov, Ruslan and Kumar, Aviral},
  booktitle = 	 {Proceedings of the 42nd International Conference on Machine Learning},
  pages = 	 {50893--50925},
  year = 	 {2025},
  editor = 	 {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry},
  volume = 	 {267},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {13--19 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v267/main/assets/qu25g/qu25g.pdf},
  url = 	 {https://proceedings.mlr.press/v267/qu25g.html},
  abstract = 	 {Training models to efficiently use test-time compute is crucial for improving the reasoning performance of LLMs. While current methods mostly do so via fine-tuning on search traces or running RL against the 0/1 outcome reward, do these approaches efficiently utilize test-time compute? Would these approaches continue to scale as the budget improves? In this paper, we try to answer these questions. We formalize the problem of optimizing test-time compute as a meta reinforcement learning (RL) problem, which provides a principled perspective on spending test-time compute from the lens of exploration and exploitation. It also motivates the use of cumulative regret to measure the efficacy of test-time compute by viewing a long output stream as consisting of several episodes from the model. While current state-of-the-art models do not optimize regret, we show that regret can be minimized by running final 0/1 reward RL regularized by a dense reward bonus, given by the "information gain" from each subsequent block in the output stream. We prescribe an approach for quantifying information gain, which measures the utility of an intermediate segment of tokens towards improving accuracy of the final answer. We instantiate this idea to develop MRT, a new class of finetuning methods for optimizing test-time compute. Fine-tuning with MRT leads to substantial improvements in both performance and token efficiency on the AIME dataset.}
}

Endnote

%0 Conference Paper
%T Optimizing Test-Time Compute via Meta Reinforcement Finetuning
%A Yuxiao Qu
%A Matthew Y. R. Yang
%A Amrith Setlur
%A Lewis Tunstall
%A Edward Emanuel Beeching
%A Ruslan Salakhutdinov
%A Aviral Kumar
%B Proceedings of the 42nd International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2025
%E Aarti Singh
%E Maryam Fazel
%E Daniel Hsu
%E Simon Lacoste-Julien
%E Felix Berkenkamp
%E Tegan Maharaj
%E Kiri Wagstaff
%E Jerry Zhu	
%F pmlr-v267-qu25g
%I PMLR
%P 50893--50925
%U https://proceedings.mlr.press/v267/qu25g.html
%V 267
%X Training models to efficiently use test-time compute is crucial for improving the reasoning performance of LLMs. While current methods mostly do so via fine-tuning on search traces or running RL against the 0/1 outcome reward, do these approaches efficiently utilize test-time compute? Would these approaches continue to scale as the budget improves? In this paper, we try to answer these questions. We formalize the problem of optimizing test-time compute as a meta reinforcement learning (RL) problem, which provides a principled perspective on spending test-time compute from the lens of exploration and exploitation. It also motivates the use of cumulative regret to measure the efficacy of test-time compute by viewing a long output stream as consisting of several episodes from the model. While current state-of-the-art models do not optimize regret, we show that regret can be minimized by running final 0/1 reward RL regularized by a dense reward bonus, given by the "information gain" from each subsequent block in the output stream. We prescribe an approach for quantifying information gain, which measures the utility of an intermediate segment of tokens towards improving accuracy of the final answer. We instantiate this idea to develop MRT, a new class of finetuning methods for optimizing test-time compute. Fine-tuning with MRT leads to substantial improvements in both performance and token efficiency on the AIME dataset.

APA

Qu, Y., Yang, M.Y.R., Setlur, A., Tunstall, L., Beeching, E.E., Salakhutdinov, R. & Kumar, A.. (2025). Optimizing Test-Time Compute via Meta Reinforcement Finetuning. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:50893-50925 Available from https://proceedings.mlr.press/v267/qu25g.html.

Optimizing Test-Time Compute via Meta Reinforcement Finetuning

Abstract

Cite this Paper

Related Material