Do We Need to Verify Step by Step? Rethinking Process Supervision from a Theoretical Perspective

Zeyu Jia, Alexander Rakhlin, Tengyang Xie
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:27373-27398, 2025.

Abstract

Process and outcome supervision represent two fundamental approaches to reinforcement learning, especially for complex reasoning tasks in large language models. While process supervision offers intuitive advantages for long-term credit assignment, the precise relationship between these paradigms has remained an open question. Conventional wisdom suggests that outcome supervision is fundamentally more challenging due to the trajectory-level coverage problem, leading to significant investment in collecting fine-grained process supervision data. In this paper, we provide a possible theoretical resolution to this debate. Perhaps surprisingly, our main theorem shows that: under standard data coverage assumptions, reinforcement learning through outcome supervision is no more statistically difficult than through process supervision. At the core of this result lies the novel Change of Trajectory Measure Lemma—a powerful technical tool that bridges return-based trajectory measure and step-level distribution shift. Furthermore, for settings with access to a verifier or a rollout capability, we prove that any policy’s advantage function can serve as an optimal process reward model, providing a simple yet powerful connection between outcome and process supervision. These findings suggest that the empirically observed performance gap between outcome and process supervision likely stems from algorithmic limitations rather than inherent statistical difficulties, potentially transforming how we approach data and algorithm design for reinforcement learning.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-jia25f, title = {Do We Need to Verify Step by Step? {R}ethinking Process Supervision from a Theoretical Perspective}, author = {Jia, Zeyu and Rakhlin, Alexander and Xie, Tengyang}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {27373--27398}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/jia25f/jia25f.pdf}, url = {https://proceedings.mlr.press/v267/jia25f.html}, abstract = {Process and outcome supervision represent two fundamental approaches to reinforcement learning, especially for complex reasoning tasks in large language models. While process supervision offers intuitive advantages for long-term credit assignment, the precise relationship between these paradigms has remained an open question. Conventional wisdom suggests that outcome supervision is fundamentally more challenging due to the trajectory-level coverage problem, leading to significant investment in collecting fine-grained process supervision data. In this paper, we provide a possible theoretical resolution to this debate. Perhaps surprisingly, our main theorem shows that: under standard data coverage assumptions, reinforcement learning through outcome supervision is no more statistically difficult than through process supervision. At the core of this result lies the novel Change of Trajectory Measure Lemma—a powerful technical tool that bridges return-based trajectory measure and step-level distribution shift. Furthermore, for settings with access to a verifier or a rollout capability, we prove that any policy’s advantage function can serve as an optimal process reward model, providing a simple yet powerful connection between outcome and process supervision. These findings suggest that the empirically observed performance gap between outcome and process supervision likely stems from algorithmic limitations rather than inherent statistical difficulties, potentially transforming how we approach data and algorithm design for reinforcement learning.} }
Endnote
%0 Conference Paper %T Do We Need to Verify Step by Step? Rethinking Process Supervision from a Theoretical Perspective %A Zeyu Jia %A Alexander Rakhlin %A Tengyang Xie %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-jia25f %I PMLR %P 27373--27398 %U https://proceedings.mlr.press/v267/jia25f.html %V 267 %X Process and outcome supervision represent two fundamental approaches to reinforcement learning, especially for complex reasoning tasks in large language models. While process supervision offers intuitive advantages for long-term credit assignment, the precise relationship between these paradigms has remained an open question. Conventional wisdom suggests that outcome supervision is fundamentally more challenging due to the trajectory-level coverage problem, leading to significant investment in collecting fine-grained process supervision data. In this paper, we provide a possible theoretical resolution to this debate. Perhaps surprisingly, our main theorem shows that: under standard data coverage assumptions, reinforcement learning through outcome supervision is no more statistically difficult than through process supervision. At the core of this result lies the novel Change of Trajectory Measure Lemma—a powerful technical tool that bridges return-based trajectory measure and step-level distribution shift. Furthermore, for settings with access to a verifier or a rollout capability, we prove that any policy’s advantage function can serve as an optimal process reward model, providing a simple yet powerful connection between outcome and process supervision. These findings suggest that the empirically observed performance gap between outcome and process supervision likely stems from algorithmic limitations rather than inherent statistical difficulties, potentially transforming how we approach data and algorithm design for reinforcement learning.
APA
Jia, Z., Rakhlin, A. & Xie, T.. (2025). Do We Need to Verify Step by Step? Rethinking Process Supervision from a Theoretical Perspective. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:27373-27398 Available from https://proceedings.mlr.press/v267/jia25f.html.

Related Material