Reasoning Through Execution: Unifying Process and Outcome Rewards for Code Generation

Zhuohao Yu, Weizheng Gu, Yidong Wang, Xingru Jiang, Zhengran Zeng, Jindong Wang, Wei Ye, Shikun Zhang
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:72972-72994, 2025.

Abstract

Large Language Models excel at code generation yet struggle with complex programming tasks that demand sophisticated reasoning. To bridge this gap, traditional process supervision relies on learned reward models requiring costly training data and suffering from reward misalignment, while outcome supervision fails for complex tasks needing coordinated intermediate steps. We introduce Outcome Refining Process Supervision, which unifies process and outcome supervision by leveraging executable verification: a tree-structured search framework generates strategic alternatives, profiles execution metrics, and scores candidates via self-critique mechanisms that integrate runtime feedback with reasoning. Experiments across 5 models and 3 benchmarks show consistent gains, with 26.9% higher correctness and 42.2% improved code efficiency. The results demonstrate that ORPS enables LLMs to overcome local optima in code generation, suggesting a promising direction for combining verifiable outcomes with structured reasoning to tackle complex challenges.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-yu25f, title = {Reasoning Through Execution: Unifying Process and Outcome Rewards for Code Generation}, author = {Yu, Zhuohao and Gu, Weizheng and Wang, Yidong and Jiang, Xingru and Zeng, Zhengran and Wang, Jindong and Ye, Wei and Zhang, Shikun}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {72972--72994}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/yu25f/yu25f.pdf}, url = {https://proceedings.mlr.press/v267/yu25f.html}, abstract = {Large Language Models excel at code generation yet struggle with complex programming tasks that demand sophisticated reasoning. To bridge this gap, traditional process supervision relies on learned reward models requiring costly training data and suffering from reward misalignment, while outcome supervision fails for complex tasks needing coordinated intermediate steps. We introduce Outcome Refining Process Supervision, which unifies process and outcome supervision by leveraging executable verification: a tree-structured search framework generates strategic alternatives, profiles execution metrics, and scores candidates via self-critique mechanisms that integrate runtime feedback with reasoning. Experiments across 5 models and 3 benchmarks show consistent gains, with 26.9% higher correctness and 42.2% improved code efficiency. The results demonstrate that ORPS enables LLMs to overcome local optima in code generation, suggesting a promising direction for combining verifiable outcomes with structured reasoning to tackle complex challenges.} }
Endnote
%0 Conference Paper %T Reasoning Through Execution: Unifying Process and Outcome Rewards for Code Generation %A Zhuohao Yu %A Weizheng Gu %A Yidong Wang %A Xingru Jiang %A Zhengran Zeng %A Jindong Wang %A Wei Ye %A Shikun Zhang %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-yu25f %I PMLR %P 72972--72994 %U https://proceedings.mlr.press/v267/yu25f.html %V 267 %X Large Language Models excel at code generation yet struggle with complex programming tasks that demand sophisticated reasoning. To bridge this gap, traditional process supervision relies on learned reward models requiring costly training data and suffering from reward misalignment, while outcome supervision fails for complex tasks needing coordinated intermediate steps. We introduce Outcome Refining Process Supervision, which unifies process and outcome supervision by leveraging executable verification: a tree-structured search framework generates strategic alternatives, profiles execution metrics, and scores candidates via self-critique mechanisms that integrate runtime feedback with reasoning. Experiments across 5 models and 3 benchmarks show consistent gains, with 26.9% higher correctness and 42.2% improved code efficiency. The results demonstrate that ORPS enables LLMs to overcome local optima in code generation, suggesting a promising direction for combining verifiable outcomes with structured reasoning to tackle complex challenges.
APA
Yu, Z., Gu, W., Wang, Y., Jiang, X., Zeng, Z., Wang, J., Ye, W. & Zhang, S.. (2025). Reasoning Through Execution: Unifying Process and Outcome Rewards for Code Generation. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:72972-72994 Available from https://proceedings.mlr.press/v267/yu25f.html.

Related Material