BRiTE: Bootstrapping Reinforced Thinking Process to Enhance Language Model Reasoning

Han Zhong; Yutong Yin; Shenao Zhang; Xiaojun Xu; Yuanxin Liu; Yifei Zuo; Zhihan Liu; Boyi Liu; Sirui Zheng; Hongyi Guo; Liwei Wang; Mingyi Hong; Zhaoran Wang

BRiTE: Bootstrapping Reinforced Thinking Process to Enhance Language Model Reasoning

Han Zhong, Yutong Yin, Shenao Zhang, Xiaojun Xu, Yuanxin Liu, Yifei Zuo, Zhihan Liu, Boyi Liu, Sirui Zheng, Hongyi Guo, Liwei Wang, Mingyi Hong, Zhaoran Wang

Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:78555-78574, 2025.

Abstract

Large Language Models (LLMs) have demonstrated remarkable capabilities in complex reasoning tasks, yet generating reliable reasoning processes remains a significant challenge. We present a unified probabilistic framework that formalizes LLM reasoning through a novel graphical model incorporating latent thinking processes and evaluation signals. Our framework addresses two critical questions: (1) how to generate high-quality reasoning processes during inference automatically, and (2) how to integrate these processes into post-training. We propose the Bootstrapping Reinforced Thinking Process (BRiTE) algorithm and demonstrate its theoretical convergence at a rate of $1/T$, where $T$ is the number of iterations. The algorithm operates in two steps. First, it generates high-quality rationales by approximating the desired posterior distribution using a reinforcement learning approach with a novel reward shaping mechanism. Second, it fine-tunes the base LLM by maximizing the joint probability of rationale generation with respect to LLM parameters. Empirical evaluation on GSM8K and MATH benchmarks demonstrates that our approach consistently improves performance across different model sizes without requiring human-annotated thinking processes, outperforming standard chain-of-thought prompting while enhancing existing post-training methods.

Cite this Paper

BibTeX

@InProceedings{pmlr-v267-zhong25e,
  title = 	 {{BR}i{TE}: Bootstrapping Reinforced Thinking Process to Enhance Language Model Reasoning},
  author =       {Zhong, Han and Yin, Yutong and Zhang, Shenao and Xu, Xiaojun and Liu, Yuanxin and Zuo, Yifei and Liu, Zhihan and Liu, Boyi and Zheng, Sirui and Guo, Hongyi and Wang, Liwei and Hong, Mingyi and Wang, Zhaoran},
  booktitle = 	 {Proceedings of the 42nd International Conference on Machine Learning},
  pages = 	 {78555--78574},
  year = 	 {2025},
  editor = 	 {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry},
  volume = 	 {267},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {13--19 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v267/main/assets/zhong25e/zhong25e.pdf},
  url = 	 {https://proceedings.mlr.press/v267/zhong25e.html},
  abstract = 	 {Large Language Models (LLMs) have demonstrated remarkable capabilities in complex reasoning tasks, yet generating reliable reasoning processes remains a significant challenge. We present a unified probabilistic framework that formalizes LLM reasoning through a novel graphical model incorporating latent thinking processes and evaluation signals. Our framework addresses two critical questions: (1) how to generate high-quality reasoning processes during inference automatically, and (2) how to integrate these processes into post-training. We propose the Bootstrapping Reinforced Thinking Process (BRiTE) algorithm and demonstrate its theoretical convergence at a rate of $1/T$, where $T$ is the number of iterations. The algorithm operates in two steps. First, it generates high-quality rationales by approximating the desired posterior distribution using a reinforcement learning approach with a novel reward shaping mechanism. Second, it fine-tunes the base LLM by maximizing the joint probability of rationale generation with respect to LLM parameters. Empirical evaluation on GSM8K and MATH benchmarks demonstrates that our approach consistently improves performance across different model sizes without requiring human-annotated thinking processes, outperforming standard chain-of-thought prompting while enhancing existing post-training methods.}
}

Endnote

%0 Conference Paper
%T BRiTE: Bootstrapping Reinforced Thinking Process to Enhance Language Model Reasoning
%A Han Zhong
%A Yutong Yin
%A Shenao Zhang
%A Xiaojun Xu
%A Yuanxin Liu
%A Yifei Zuo
%A Zhihan Liu
%A Boyi Liu
%A Sirui Zheng
%A Hongyi Guo
%A Liwei Wang
%A Mingyi Hong
%A Zhaoran Wang
%B Proceedings of the 42nd International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2025
%E Aarti Singh
%E Maryam Fazel
%E Daniel Hsu
%E Simon Lacoste-Julien
%E Felix Berkenkamp
%E Tegan Maharaj
%E Kiri Wagstaff
%E Jerry Zhu	
%F pmlr-v267-zhong25e
%I PMLR
%P 78555--78574
%U https://proceedings.mlr.press/v267/zhong25e.html
%V 267
%X Large Language Models (LLMs) have demonstrated remarkable capabilities in complex reasoning tasks, yet generating reliable reasoning processes remains a significant challenge. We present a unified probabilistic framework that formalizes LLM reasoning through a novel graphical model incorporating latent thinking processes and evaluation signals. Our framework addresses two critical questions: (1) how to generate high-quality reasoning processes during inference automatically, and (2) how to integrate these processes into post-training. We propose the Bootstrapping Reinforced Thinking Process (BRiTE) algorithm and demonstrate its theoretical convergence at a rate of $1/T$, where $T$ is the number of iterations. The algorithm operates in two steps. First, it generates high-quality rationales by approximating the desired posterior distribution using a reinforcement learning approach with a novel reward shaping mechanism. Second, it fine-tunes the base LLM by maximizing the joint probability of rationale generation with respect to LLM parameters. Empirical evaluation on GSM8K and MATH benchmarks demonstrates that our approach consistently improves performance across different model sizes without requiring human-annotated thinking processes, outperforming standard chain-of-thought prompting while enhancing existing post-training methods.

APA

Zhong, H., Yin, Y., Zhang, S., Xu, X., Liu, Y., Zuo, Y., Liu, Z., Liu, B., Zheng, S., Guo, H., Wang, L., Hong, M. & Wang, Z.. (2025). BRiTE: Bootstrapping Reinforced Thinking Process to Enhance Language Model Reasoning. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:78555-78574 Available from https://proceedings.mlr.press/v267/zhong25e.html.

BRiTE: Bootstrapping Reinforced Thinking Process to Enhance Language Model Reasoning

Abstract

Cite this Paper

Related Material