T1: Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling

Zhenyu Hou, Xin Lv, Rui Lu, Jiajie Zhang, Yujiang Li, Zijun Yao, Juanzi Li, Jie Tang, Yuxiao Dong
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:23976-24003, 2025.

Abstract

Large language models (LLMs) have demonstrated remarkable capabilities in complex reasoning tasks. However, existing approaches mainly rely on imitation learning and struggle to achieve effective test-time scaling. While reinforcement learning (RL) holds promise for enabling self-exploration, recent attempts yield modest improvements in complex reasoning. In this paper, we present T1 to scale RL by encouraging exploration and understand inference scaling. We first initialize the LLM using synthesized chain-of-thought data that integrates trial-and-error and self-verification. To scale RL training, we promote increased sampling diversity through over-sampling. We demonstrate that T1 with open LLMs as its base exhibits inference scaling behavior and achieves superior performance on challenging math reasoning benchmarks. More importantly, we present a simple strategy to examine inference scaling, where increased inference budgets directly lead to T1’s better performance without any additional verification. The model weights and training data are publicly available at https://github.com/THUDM/T1.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-hou25e, title = {T1: Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling}, author = {Hou, Zhenyu and Lv, Xin and Lu, Rui and Zhang, Jiajie and Li, Yujiang and Yao, Zijun and Li, Juanzi and Tang, Jie and Dong, Yuxiao}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {23976--24003}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/hou25e/hou25e.pdf}, url = {https://proceedings.mlr.press/v267/hou25e.html}, abstract = {Large language models (LLMs) have demonstrated remarkable capabilities in complex reasoning tasks. However, existing approaches mainly rely on imitation learning and struggle to achieve effective test-time scaling. While reinforcement learning (RL) holds promise for enabling self-exploration, recent attempts yield modest improvements in complex reasoning. In this paper, we present T1 to scale RL by encouraging exploration and understand inference scaling. We first initialize the LLM using synthesized chain-of-thought data that integrates trial-and-error and self-verification. To scale RL training, we promote increased sampling diversity through over-sampling. We demonstrate that T1 with open LLMs as its base exhibits inference scaling behavior and achieves superior performance on challenging math reasoning benchmarks. More importantly, we present a simple strategy to examine inference scaling, where increased inference budgets directly lead to T1’s better performance without any additional verification. The model weights and training data are publicly available at https://github.com/THUDM/T1.} }
Endnote
%0 Conference Paper %T T1: Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling %A Zhenyu Hou %A Xin Lv %A Rui Lu %A Jiajie Zhang %A Yujiang Li %A Zijun Yao %A Juanzi Li %A Jie Tang %A Yuxiao Dong %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-hou25e %I PMLR %P 23976--24003 %U https://proceedings.mlr.press/v267/hou25e.html %V 267 %X Large language models (LLMs) have demonstrated remarkable capabilities in complex reasoning tasks. However, existing approaches mainly rely on imitation learning and struggle to achieve effective test-time scaling. While reinforcement learning (RL) holds promise for enabling self-exploration, recent attempts yield modest improvements in complex reasoning. In this paper, we present T1 to scale RL by encouraging exploration and understand inference scaling. We first initialize the LLM using synthesized chain-of-thought data that integrates trial-and-error and self-verification. To scale RL training, we promote increased sampling diversity through over-sampling. We demonstrate that T1 with open LLMs as its base exhibits inference scaling behavior and achieves superior performance on challenging math reasoning benchmarks. More importantly, we present a simple strategy to examine inference scaling, where increased inference budgets directly lead to T1’s better performance without any additional verification. The model weights and training data are publicly available at https://github.com/THUDM/T1.
APA
Hou, Z., Lv, X., Lu, R., Zhang, J., Li, Y., Yao, Z., Li, J., Tang, J. & Dong, Y.. (2025). T1: Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:23976-24003 Available from https://proceedings.mlr.press/v267/hou25e.html.

Related Material