VersaPRM: Multi-Domain Process Reward Model via Synthetic Reasoning Data

Thomas Zeng, Shuibai Zhang, Shutong Wu, Christian Classen, Daewon Chae, Ethan Ewer, Minjae Lee, Heeju Kim, Wonjun Kang, Jackson Kunde, Ying Fan, Jungtaek Kim, Hyung Il Koo, Kannan Ramchandran, Dimitris Papailiopoulos, Kangwook Lee
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:74197-74239, 2025.

Abstract

Process Reward Models (PRMs) have proven effective at enhancing mathematical reasoning for Large Language Models (LLMs) by leveraging increased inference-time computation. However, they are predominantly trained on mathematical data and their generalizability to non-mathematical domains has not been rigorously studied. In response, this work first shows that current PRMs have poor performance in other domains. To address this limitation, we introduce VersaPRM, a multi-domain PRM trained on synthetic reasoning data generated using our novel data generation and annotation method. VersaPRM achieves consistent performance gains across diverse domains. For instance, in the MMLU-Pro category of Law, VersaPRM via weighted majority voting, achieves a 7.9% performance gain over the majority voting baseline–surpassing Qwen2.5-Math-PRM’s gain of 1.3%. We further contribute to the community by open-sourcing all data, code and models for VersaPRM.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-zeng25h, title = {{V}ersa{PRM}: Multi-Domain Process Reward Model via Synthetic Reasoning Data}, author = {Zeng, Thomas and Zhang, Shuibai and Wu, Shutong and Classen, Christian and Chae, Daewon and Ewer, Ethan and Lee, Minjae and Kim, Heeju and Kang, Wonjun and Kunde, Jackson and Fan, Ying and Kim, Jungtaek and Koo, Hyung Il and Ramchandran, Kannan and Papailiopoulos, Dimitris and Lee, Kangwook}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {74197--74239}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/zeng25h/zeng25h.pdf}, url = {https://proceedings.mlr.press/v267/zeng25h.html}, abstract = {Process Reward Models (PRMs) have proven effective at enhancing mathematical reasoning for Large Language Models (LLMs) by leveraging increased inference-time computation. However, they are predominantly trained on mathematical data and their generalizability to non-mathematical domains has not been rigorously studied. In response, this work first shows that current PRMs have poor performance in other domains. To address this limitation, we introduce VersaPRM, a multi-domain PRM trained on synthetic reasoning data generated using our novel data generation and annotation method. VersaPRM achieves consistent performance gains across diverse domains. For instance, in the MMLU-Pro category of Law, VersaPRM via weighted majority voting, achieves a 7.9% performance gain over the majority voting baseline–surpassing Qwen2.5-Math-PRM’s gain of 1.3%. We further contribute to the community by open-sourcing all data, code and models for VersaPRM.} }
Endnote
%0 Conference Paper %T VersaPRM: Multi-Domain Process Reward Model via Synthetic Reasoning Data %A Thomas Zeng %A Shuibai Zhang %A Shutong Wu %A Christian Classen %A Daewon Chae %A Ethan Ewer %A Minjae Lee %A Heeju Kim %A Wonjun Kang %A Jackson Kunde %A Ying Fan %A Jungtaek Kim %A Hyung Il Koo %A Kannan Ramchandran %A Dimitris Papailiopoulos %A Kangwook Lee %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-zeng25h %I PMLR %P 74197--74239 %U https://proceedings.mlr.press/v267/zeng25h.html %V 267 %X Process Reward Models (PRMs) have proven effective at enhancing mathematical reasoning for Large Language Models (LLMs) by leveraging increased inference-time computation. However, they are predominantly trained on mathematical data and their generalizability to non-mathematical domains has not been rigorously studied. In response, this work first shows that current PRMs have poor performance in other domains. To address this limitation, we introduce VersaPRM, a multi-domain PRM trained on synthetic reasoning data generated using our novel data generation and annotation method. VersaPRM achieves consistent performance gains across diverse domains. For instance, in the MMLU-Pro category of Law, VersaPRM via weighted majority voting, achieves a 7.9% performance gain over the majority voting baseline–surpassing Qwen2.5-Math-PRM’s gain of 1.3%. We further contribute to the community by open-sourcing all data, code and models for VersaPRM.
APA
Zeng, T., Zhang, S., Wu, S., Classen, C., Chae, D., Ewer, E., Lee, M., Kim, H., Kang, W., Kunde, J., Fan, Y., Kim, J., Koo, H.I., Ramchandran, K., Papailiopoulos, D. & Lee, K.. (2025). VersaPRM: Multi-Domain Process Reward Model via Synthetic Reasoning Data. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:74197-74239 Available from https://proceedings.mlr.press/v267/zeng25h.html.

Related Material