RLEF: Grounding Code LLMs in Execution Feedback with Reinforcement Learning

Jonas Gehring, Kunhao Zheng, Jade Copet, Vegard Mella, Taco Cohen, Gabriel Synnaeve
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:19034-19055, 2025.

Abstract

Large language models (LLMs) deployed as agents solve user-specified tasks over multiple steps while keeping the required manual engagement to a minimum. Crucially, such LLMs need to ground their generations in any feedback obtained to reliably achieve the desired outcomes. We propose an end-to-end reinforcement learning method for teaching models to leverage execution feedback in the realm of code synthesis, where state-of-the-art LLMs struggle to improve code iteratively compared to independent sampling. We benchmark on competitive programming tasks and achieve large performance gains with both small (8B parameters) and large (70B) models, outperforming previous work while reducing the number of samples required by an order of magnitude. Our analysis of inference-time behavior demonstrates that our method produces LLMs that effectively leverage automatic feedback over multiple steps.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-gehring25a, title = {{RLEF}: Grounding Code {LLM}s in Execution Feedback with Reinforcement Learning}, author = {Gehring, Jonas and Zheng, Kunhao and Copet, Jade and Mella, Vegard and Cohen, Taco and Synnaeve, Gabriel}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {19034--19055}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/gehring25a/gehring25a.pdf}, url = {https://proceedings.mlr.press/v267/gehring25a.html}, abstract = {Large language models (LLMs) deployed as agents solve user-specified tasks over multiple steps while keeping the required manual engagement to a minimum. Crucially, such LLMs need to ground their generations in any feedback obtained to reliably achieve the desired outcomes. We propose an end-to-end reinforcement learning method for teaching models to leverage execution feedback in the realm of code synthesis, where state-of-the-art LLMs struggle to improve code iteratively compared to independent sampling. We benchmark on competitive programming tasks and achieve large performance gains with both small (8B parameters) and large (70B) models, outperforming previous work while reducing the number of samples required by an order of magnitude. Our analysis of inference-time behavior demonstrates that our method produces LLMs that effectively leverage automatic feedback over multiple steps.} }
Endnote
%0 Conference Paper %T RLEF: Grounding Code LLMs in Execution Feedback with Reinforcement Learning %A Jonas Gehring %A Kunhao Zheng %A Jade Copet %A Vegard Mella %A Taco Cohen %A Gabriel Synnaeve %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-gehring25a %I PMLR %P 19034--19055 %U https://proceedings.mlr.press/v267/gehring25a.html %V 267 %X Large language models (LLMs) deployed as agents solve user-specified tasks over multiple steps while keeping the required manual engagement to a minimum. Crucially, such LLMs need to ground their generations in any feedback obtained to reliably achieve the desired outcomes. We propose an end-to-end reinforcement learning method for teaching models to leverage execution feedback in the realm of code synthesis, where state-of-the-art LLMs struggle to improve code iteratively compared to independent sampling. We benchmark on competitive programming tasks and achieve large performance gains with both small (8B parameters) and large (70B) models, outperforming previous work while reducing the number of samples required by an order of magnitude. Our analysis of inference-time behavior demonstrates that our method produces LLMs that effectively leverage automatic feedback over multiple steps.
APA
Gehring, J., Zheng, K., Copet, J., Mella, V., Cohen, T. & Synnaeve, G.. (2025). RLEF: Grounding Code LLMs in Execution Feedback with Reinforcement Learning. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:19034-19055 Available from https://proceedings.mlr.press/v267/gehring25a.html.

Related Material