Reward-Guided Prompt Evolving in Reinforcement Learning for LLMs

Ziyu Ye, Rishabh Agarwal, Tianqi Liu, Rishabh Joshi, Sarmishta Velury, Quoc V Le, Qijun Tan, Yuan Liu
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:71910-71937, 2025.

Abstract

Existing reinforcement learning (RL) methods for large language models (LLMs) rely on static prompt sets, where prompts are curated a priori, and sampled in a fixed schedule for training, regardless of their usefulness to the RL process. We design eva, the first method that allows LLMs to prioritize and adaptively create useful prompts during RL training by reward signals. In principle, eva (Evolving via A symmetric Self-Play) casts language model training as a game between: (1) a creator, who samples and generates training prompts, and (2) a solver, who generates responses to the prompts. eva is simple, suits both offline and online RL for LLMs, and sets a new state-of-the-art on challenging benchmarks without extra human prompts: it improves gemma-2-9b-it’s win-rate on Arena-Hard from 51.6% to 60.1% by DPO and 52.6% to 62.4% by RLOO, surpassing claude-3-opus and nearing gemini-1.5-pro, both are orders of magnitude larger. Further ablation studies show eva can induce meaningful learning curriculum, and effectively scale RL for LLMs beyond static human prompts.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-ye25a, title = {Reward-Guided Prompt Evolving in Reinforcement Learning for {LLM}s}, author = {Ye, Ziyu and Agarwal, Rishabh and Liu, Tianqi and Joshi, Rishabh and Velury, Sarmishta and Le, Quoc V and Tan, Qijun and Liu, Yuan}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {71910--71937}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/ye25a/ye25a.pdf}, url = {https://proceedings.mlr.press/v267/ye25a.html}, abstract = {Existing reinforcement learning (RL) methods for large language models (LLMs) rely on static prompt sets, where prompts are curated a priori, and sampled in a fixed schedule for training, regardless of their usefulness to the RL process. We design eva, the first method that allows LLMs to prioritize and adaptively create useful prompts during RL training by reward signals. In principle, eva (Evolving via A symmetric Self-Play) casts language model training as a game between: (1) a creator, who samples and generates training prompts, and (2) a solver, who generates responses to the prompts. eva is simple, suits both offline and online RL for LLMs, and sets a new state-of-the-art on challenging benchmarks without extra human prompts: it improves gemma-2-9b-it’s win-rate on Arena-Hard from 51.6% to 60.1% by DPO and 52.6% to 62.4% by RLOO, surpassing claude-3-opus and nearing gemini-1.5-pro, both are orders of magnitude larger. Further ablation studies show eva can induce meaningful learning curriculum, and effectively scale RL for LLMs beyond static human prompts.} }
Endnote
%0 Conference Paper %T Reward-Guided Prompt Evolving in Reinforcement Learning for LLMs %A Ziyu Ye %A Rishabh Agarwal %A Tianqi Liu %A Rishabh Joshi %A Sarmishta Velury %A Quoc V Le %A Qijun Tan %A Yuan Liu %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-ye25a %I PMLR %P 71910--71937 %U https://proceedings.mlr.press/v267/ye25a.html %V 267 %X Existing reinforcement learning (RL) methods for large language models (LLMs) rely on static prompt sets, where prompts are curated a priori, and sampled in a fixed schedule for training, regardless of their usefulness to the RL process. We design eva, the first method that allows LLMs to prioritize and adaptively create useful prompts during RL training by reward signals. In principle, eva (Evolving via A symmetric Self-Play) casts language model training as a game between: (1) a creator, who samples and generates training prompts, and (2) a solver, who generates responses to the prompts. eva is simple, suits both offline and online RL for LLMs, and sets a new state-of-the-art on challenging benchmarks without extra human prompts: it improves gemma-2-9b-it’s win-rate on Arena-Hard from 51.6% to 60.1% by DPO and 52.6% to 62.4% by RLOO, surpassing claude-3-opus and nearing gemini-1.5-pro, both are orders of magnitude larger. Further ablation studies show eva can induce meaningful learning curriculum, and effectively scale RL for LLMs beyond static human prompts.
APA
Ye, Z., Agarwal, R., Liu, T., Joshi, R., Velury, S., Le, Q.V., Tan, Q. & Liu, Y.. (2025). Reward-Guided Prompt Evolving in Reinforcement Learning for LLMs. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:71910-71937 Available from https://proceedings.mlr.press/v267/ye25a.html.

Related Material