R*: Efficient Reward Design via Reward Structure Evolution and Parameter Alignment Optimization with Large Language Models

Pengyi Li; Jianye Hao; Hongyao Tang; Yifu Yuan; Jinbin Qiao; Zibin Dong; Yan Zheng

R*: Efficient Reward Design via Reward Structure Evolution and Parameter Alignment Optimization with Large Language Models

Pengyi Li, Jianye Hao, Hongyao Tang, Yifu Yuan, Jinbin Qiao, Zibin Dong, Yan Zheng

Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:34509-34527, 2025.

Abstract

Reward functions are crucial for policy learning. Large Language Models (LLMs), with strong coding capabilities and valuable domain knowledge, provide an automated solution for high-quality reward design. However, code-based reward functions require precise guiding logic and parameter configurations within a vast design space, leading to low optimization efficiency. To address the challenges, we propose an efficient automated reward design framework, called R, which decomposes reward design into two parts: reward structure evolution and parameter alignment optimization. To design high-quality reward structures, R maintains a reward function population and modularizes the functional components. LLMs are employed as the mutation operator, and module-level crossover is proposed to facilitate efficient exploration and exploitation. To design more efficient reward parameters, R first leverages LLMs to generate multiple critic functions for trajectory comparison and annotation. Based on these critics, a voting mechanism is employed to collect the trajectory segments with high-confidence labels. These labeled segments are then used to refine the reward function parameters through preference learning. Experiments on diverse robotic control tasks demonstrate that R outperforms strong baselines in both reward design efficiency and quality, surpassing human-designed reward functions.

Cite this Paper

BibTeX

@InProceedings{pmlr-v267-li25v,
  title = 	 {R*: Efficient Reward Design via Reward Structure Evolution and Parameter Alignment Optimization with Large Language Models},
  author =       {Li, Pengyi and Hao, Jianye and Tang, Hongyao and Yuan, Yifu and Qiao, Jinbin and Dong, Zibin and Zheng, Yan},
  booktitle = 	 {Proceedings of the 42nd International Conference on Machine Learning},
  pages = 	 {34509--34527},
  year = 	 {2025},
  editor = 	 {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry},
  volume = 	 {267},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {13--19 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v267/main/assets/li25v/li25v.pdf},
  url = 	 {https://proceedings.mlr.press/v267/li25v.html},
  abstract = 	 {Reward functions are crucial for policy learning. Large Language Models (LLMs), with strong coding capabilities and valuable domain knowledge, provide an automated solution for high-quality reward design. However, code-based reward functions require precise guiding logic and parameter configurations within a vast design space, leading to low optimization efficiency. To address the challenges, we propose an efficient automated reward design framework, called R, which decomposes reward design into two parts: reward structure evolution and parameter alignment optimization. To design high-quality reward structures, R maintains a reward function population and modularizes the functional components. LLMs are employed as the mutation operator, and module-level crossover is proposed to facilitate efficient exploration and exploitation. To design more efficient reward parameters, R first leverages LLMs to generate multiple critic functions for trajectory comparison and annotation. Based on these critics, a voting mechanism is employed to collect the trajectory segments with high-confidence labels. These labeled segments are then used to refine the reward function parameters through preference learning. Experiments on diverse robotic control tasks demonstrate that R outperforms strong baselines in both reward design efficiency and quality, surpassing human-designed reward functions.}
}

Endnote

%0 Conference Paper
%T R*: Efficient Reward Design via Reward Structure Evolution and Parameter Alignment Optimization with Large Language Models
%A Pengyi Li
%A Jianye Hao
%A Hongyao Tang
%A Yifu Yuan
%A Jinbin Qiao
%A Zibin Dong
%A Yan Zheng
%B Proceedings of the 42nd International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2025
%E Aarti Singh
%E Maryam Fazel
%E Daniel Hsu
%E Simon Lacoste-Julien
%E Felix Berkenkamp
%E Tegan Maharaj
%E Kiri Wagstaff
%E Jerry Zhu	
%F pmlr-v267-li25v
%I PMLR
%P 34509--34527
%U https://proceedings.mlr.press/v267/li25v.html
%V 267
%X Reward functions are crucial for policy learning. Large Language Models (LLMs), with strong coding capabilities and valuable domain knowledge, provide an automated solution for high-quality reward design. However, code-based reward functions require precise guiding logic and parameter configurations within a vast design space, leading to low optimization efficiency. To address the challenges, we propose an efficient automated reward design framework, called R, which decomposes reward design into two parts: reward structure evolution and parameter alignment optimization. To design high-quality reward structures, R maintains a reward function population and modularizes the functional components. LLMs are employed as the mutation operator, and module-level crossover is proposed to facilitate efficient exploration and exploitation. To design more efficient reward parameters, R first leverages LLMs to generate multiple critic functions for trajectory comparison and annotation. Based on these critics, a voting mechanism is employed to collect the trajectory segments with high-confidence labels. These labeled segments are then used to refine the reward function parameters through preference learning. Experiments on diverse robotic control tasks demonstrate that R outperforms strong baselines in both reward design efficiency and quality, surpassing human-designed reward functions.

APA

Li, P., Hao, J., Tang, H., Yuan, Y., Qiao, J., Dong, Z. & Zheng, Y.. (2025). R*: Efficient Reward Design via Reward Structure Evolution and Parameter Alignment Optimization with Large Language Models. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:34509-34527 Available from https://proceedings.mlr.press/v267/li25v.html.

R*: Efficient Reward Design via Reward Structure Evolution and Parameter Alignment Optimization with Large Language Models

Abstract

Cite this Paper

Related Material