GLoRe: When, Where, and How to Improve LLM Reasoning via Global and Local Refinements

Alexander Havrilla; Sharath Chandra Raparthy; Christoforos Nalmpantis; Jane Dwivedi-Yu; Maksym Zhuravinskyi; Eric Hambro; Roberta Raileanu

GLoRe: When, Where, and How to Improve LLM Reasoning via Global and Local Refinements

Alexander Havrilla, Sharath Chandra Raparthy, Christoforos Nalmpantis, Jane Dwivedi-Yu, Maksym Zhuravinskyi, Eric Hambro, Roberta Raileanu

Proceedings of the 41st International Conference on Machine Learning, PMLR 235:17719-17733, 2024.

Abstract

State-of-the-art language models can exhibit reasoning refinement capabilities on math, science or coding tasks. However, recent work demonstrates that even the best models struggle to identify when and where to refine without access to external feedback. In this paper, we propose Stepwise ORMs (SORMs) which are trained, only on synthetic data, to approximate the expected future reward of the optimal policy or

$V^{\star}$ as a form of Process-based reward modeling. Our experiments show that SORMs can more accurately detect incorrect reasoning steps compared to ORMs, thus enabling them to give precise step-level feedback to refinement models. We then train global refinement models, which take only the question and a draft solution as input and predict a corrected solution, and local refinement models which also take as input a critique indicating the location of the first reasoning error. We generate training data for both models synthetically by reusing data used to train the SORM. We find combining global and local refinements, using the ORM as a reranker, significantly outperforms either one individually, as well as a best of three sample baseline. With this strategy we can improve the accuracy of a LLaMA-2 13B model (already fine-tuned with RL) on GSM8K from 53% to 65% when greedily sampled.

Cite this Paper

BibTeX


@InProceedings{pmlr-v235-havrilla24a,
  title = 	 {{GL}o{R}e: When, Where, and How to Improve {LLM} Reasoning via Global and Local Refinements},
  author =       {Havrilla, Alexander and Raparthy, Sharath Chandra and Nalmpantis, Christoforos and Dwivedi-Yu, Jane and Zhuravinskyi, Maksym and Hambro, Eric and Raileanu, Roberta},
  booktitle = 	 {Proceedings of the 41st International Conference on Machine Learning},
  pages = 	 {17719--17733},
  year = 	 {2024},
  editor = 	 {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix},
  volume = 	 {235},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {21--27 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v235/main/assets/havrilla24a/havrilla24a.pdf},
  url = 	 {https://proceedings.mlr.press/v235/havrilla24a.html},
  abstract = 	 {State-of-the-art language models can exhibit reasoning refinement capabilities on math, science or coding tasks. However, recent work demonstrates that even the best models struggle to identify when and where to refine without access to external feedback. In this paper, we propose Stepwise ORMs (SORMs) which are trained, only on synthetic data, to approximate the expected future reward of the optimal policy or $V^{\star}$ as a form of Process-based reward modeling. Our experiments show that SORMs can more accurately detect incorrect reasoning steps compared to ORMs, thus enabling them to give precise step-level feedback to refinement models. We then train global refinement models, which take only the question and a draft solution as input and predict a corrected solution, and local refinement models which also take as input a critique indicating the location of the first reasoning error. We generate training data for both models synthetically by reusing data used to train the SORM. We find combining global and local refinements, using the ORM as a reranker, significantly outperforms either one individually, as well as a best of three sample baseline. With this strategy we can improve the accuracy of a LLaMA-2 13B model (already fine-tuned with RL) on GSM8K from 53% to 65% when greedily sampled.}
}

Endnote

%0 Conference Paper
%T GLoRe: When, Where, and How to Improve LLM Reasoning via Global and Local Refinements
%A Alexander Havrilla
%A Sharath Chandra Raparthy
%A Christoforos Nalmpantis
%A Jane Dwivedi-Yu
%A Maksym Zhuravinskyi
%A Eric Hambro
%A Roberta Raileanu
%B Proceedings of the 41st International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2024
%E Ruslan Salakhutdinov
%E Zico Kolter
%E Katherine Heller
%E Adrian Weller
%E Nuria Oliver
%E Jonathan Scarlett
%E Felix Berkenkamp	
%F pmlr-v235-havrilla24a
%I PMLR
%P 17719--17733
%U https://proceedings.mlr.press/v235/havrilla24a.html
%V 235
%X State-of-the-art language models can exhibit reasoning refinement capabilities on math, science or coding tasks. However, recent work demonstrates that even the best models struggle to identify when and where to refine without access to external feedback. In this paper, we propose Stepwise ORMs (SORMs) which are trained, only on synthetic data, to approximate the expected future reward of the optimal policy or $V^{\star}$ as a form of Process-based reward modeling. Our experiments show that SORMs can more accurately detect incorrect reasoning steps compared to ORMs, thus enabling them to give precise step-level feedback to refinement models. We then train global refinement models, which take only the question and a draft solution as input and predict a corrected solution, and local refinement models which also take as input a critique indicating the location of the first reasoning error. We generate training data for both models synthetically by reusing data used to train the SORM. We find combining global and local refinements, using the ORM as a reranker, significantly outperforms either one individually, as well as a best of three sample baseline. With this strategy we can improve the accuracy of a LLaMA-2 13B model (already fine-tuned with RL) on GSM8K from 53% to 65% when greedily sampled.

APA


Havrilla, A., Raparthy, S.C., Nalmpantis, C., Dwivedi-Yu, J., Zhuravinskyi, M., Hambro, E. & Raileanu, R.. (2024). GLoRe: When, Where, and How to Improve LLM Reasoning via Global and Local Refinements. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:17719-17733 Available from https://proceedings.mlr.press/v235/havrilla24a.html.

GLoRe: When, Where, and How to Improve LLM Reasoning via Global and Local Refinements

Abstract

Cite this Paper

Related Material