Improving Rationality in the Reasoning Process of Language Models through Self-playing Game

Pinzheng Wang, Juntao Li, Zecheng Tang, Haijia Gui, Min Zhang
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:63417-63438, 2025.

Abstract

Large language models (LLMs) have demonstrated considerable reasoning abilities in various tasks such as mathematics and coding. However, recent studies indicate that even the best models lack true comprehension of their reasoning processes. In this paper, we explore how self-play can enhance the rationality of models in the reasoning process without supervision from humans or superior models. We design a $\textit{\textbf{C}ritic-\textbf{D}iscernment \textbf{G}ame} (\textbf{CDG})$ in which a prover first provides a solution to a given problem and is subsequently challenged by critiques of its solution. These critiques either aim to assist or mislead the prover. The objective of the prover is to maintain the correct answer when faced with misleading comments, while correcting errors in response to constructive feedback. Our experiments on tasks involving mathematical reasoning, stepwise error detection, self-correction, and long-chain reasoning demonstrate that CDG training can significantly improve the ability of well-aligned LLMs to comprehend their reasoning process.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-wang25bb, title = {Improving Rationality in the Reasoning Process of Language Models through Self-playing Game}, author = {Wang, Pinzheng and Li, Juntao and Tang, Zecheng and Gui, Haijia and Zhang, Min}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {63417--63438}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/wang25bb/wang25bb.pdf}, url = {https://proceedings.mlr.press/v267/wang25bb.html}, abstract = {Large language models (LLMs) have demonstrated considerable reasoning abilities in various tasks such as mathematics and coding. However, recent studies indicate that even the best models lack true comprehension of their reasoning processes. In this paper, we explore how self-play can enhance the rationality of models in the reasoning process without supervision from humans or superior models. We design a $\textit{\textbf{C}ritic-\textbf{D}iscernment \textbf{G}ame} (\textbf{CDG})$ in which a prover first provides a solution to a given problem and is subsequently challenged by critiques of its solution. These critiques either aim to assist or mislead the prover. The objective of the prover is to maintain the correct answer when faced with misleading comments, while correcting errors in response to constructive feedback. Our experiments on tasks involving mathematical reasoning, stepwise error detection, self-correction, and long-chain reasoning demonstrate that CDG training can significantly improve the ability of well-aligned LLMs to comprehend their reasoning process.} }
Endnote
%0 Conference Paper %T Improving Rationality in the Reasoning Process of Language Models through Self-playing Game %A Pinzheng Wang %A Juntao Li %A Zecheng Tang %A Haijia Gui %A Min Zhang %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-wang25bb %I PMLR %P 63417--63438 %U https://proceedings.mlr.press/v267/wang25bb.html %V 267 %X Large language models (LLMs) have demonstrated considerable reasoning abilities in various tasks such as mathematics and coding. However, recent studies indicate that even the best models lack true comprehension of their reasoning processes. In this paper, we explore how self-play can enhance the rationality of models in the reasoning process without supervision from humans or superior models. We design a $\textit{\textbf{C}ritic-\textbf{D}iscernment \textbf{G}ame} (\textbf{CDG})$ in which a prover first provides a solution to a given problem and is subsequently challenged by critiques of its solution. These critiques either aim to assist or mislead the prover. The objective of the prover is to maintain the correct answer when faced with misleading comments, while correcting errors in response to constructive feedback. Our experiments on tasks involving mathematical reasoning, stepwise error detection, self-correction, and long-chain reasoning demonstrate that CDG training can significantly improve the ability of well-aligned LLMs to comprehend their reasoning process.
APA
Wang, P., Li, J., Tang, Z., Gui, H. & Zhang, M.. (2025). Improving Rationality in the Reasoning Process of Language Models through Self-playing Game. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:63417-63438 Available from https://proceedings.mlr.press/v267/wang25bb.html.

Related Material