Language Models as Implicit Tree Search

Ziliang Chen; Zhao-Rong Lai; Yufeng Yang; Liangda Fang; Zhanfu Yang; Liang Lin

Language Models as Implicit Tree Search

Ziliang Chen, Zhao-Rong Lai, Yufeng Yang, Liangda Fang, Zhanfu Yang, Liang Lin

Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:8364-8385, 2025.

Abstract

Despite advancing language model (LM) alignment, direct preference optimization (DPO) falls short in LM reasoning with the free lunch from reinforcement learning (RL). As the breakthrough, this work proposes a new RL-free preference optimization method aiming to achieve DPO along with learning another LM, whose response generation policy holds the asymptotic equivalence with AlphaZero-like search, the apex of algorithms for complex reasoning missions like chess Go. While circumventing explicit value and reward modeling, the neural implicit tree search executed by the extra LM remains seeking to equip DPO with reasoning procedure technically akin to AlphaZero. Our experiments demonstrate that our methodology outperforms both regular DPO variants in human preference alignment, and MCTS-based LMs in mathematical reasoning and planning tasks.

Cite this Paper

BibTeX

@InProceedings{pmlr-v267-chen25ag,
  title = 	 {Language Models as Implicit Tree Search},
  author =       {Chen, Ziliang and Lai, Zhao-Rong and Yang, Yufeng and Fang, Liangda and Yang, Zhanfu and Lin, Liang},
  booktitle = 	 {Proceedings of the 42nd International Conference on Machine Learning},
  pages = 	 {8364--8385},
  year = 	 {2025},
  editor = 	 {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry},
  volume = 	 {267},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {13--19 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v267/main/assets/chen25ag/chen25ag.pdf},
  url = 	 {https://proceedings.mlr.press/v267/chen25ag.html},
  abstract = 	 {Despite advancing language model (LM) alignment, direct preference optimization (DPO) falls short in LM reasoning with the free lunch from reinforcement learning (RL). As the breakthrough, this work proposes a new RL-free preference optimization method aiming to achieve DPO along with learning another LM, whose response generation policy holds the asymptotic equivalence with AlphaZero-like search, the apex of algorithms for complex reasoning missions like chess Go. While circumventing explicit value and reward modeling, the neural implicit tree search executed by the extra LM remains seeking to equip DPO with reasoning procedure technically akin to AlphaZero. Our experiments demonstrate that our methodology outperforms both regular DPO variants in human preference alignment, and MCTS-based LMs in mathematical reasoning and planning tasks.}
}

Endnote

%0 Conference Paper
%T Language Models as Implicit Tree Search
%A Ziliang Chen
%A Zhao-Rong Lai
%A Yufeng Yang
%A Liangda Fang
%A Zhanfu Yang
%A Liang Lin
%B Proceedings of the 42nd International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2025
%E Aarti Singh
%E Maryam Fazel
%E Daniel Hsu
%E Simon Lacoste-Julien
%E Felix Berkenkamp
%E Tegan Maharaj
%E Kiri Wagstaff
%E Jerry Zhu	
%F pmlr-v267-chen25ag
%I PMLR
%P 8364--8385
%U https://proceedings.mlr.press/v267/chen25ag.html
%V 267
%X Despite advancing language model (LM) alignment, direct preference optimization (DPO) falls short in LM reasoning with the free lunch from reinforcement learning (RL). As the breakthrough, this work proposes a new RL-free preference optimization method aiming to achieve DPO along with learning another LM, whose response generation policy holds the asymptotic equivalence with AlphaZero-like search, the apex of algorithms for complex reasoning missions like chess Go. While circumventing explicit value and reward modeling, the neural implicit tree search executed by the extra LM remains seeking to equip DPO with reasoning procedure technically akin to AlphaZero. Our experiments demonstrate that our methodology outperforms both regular DPO variants in human preference alignment, and MCTS-based LMs in mathematical reasoning and planning tasks.

APA

Chen, Z., Lai, Z., Yang, Y., Fang, L., Yang, Z. & Lin, L.. (2025). Language Models as Implicit Tree Search. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:8364-8385 Available from https://proceedings.mlr.press/v267/chen25ag.html.

Language Models as Implicit Tree Search

Abstract

Cite this Paper

Related Material