[edit]
R*: Efficient Reward Design via Reward Structure Evolution and Parameter Alignment Optimization with Large Language Models
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:34509-34527, 2025.
Abstract
Reward functions are crucial for policy learning. Large Language Models (LLMs), with strong coding capabilities and valuable domain knowledge, provide an automated solution for high-quality reward design. However, code-based reward functions require precise guiding logic and parameter configurations within a vast design space, leading to low optimization efficiency. To address the challenges, we propose an efficient automated reward design framework, called R, which decomposes reward design into two parts: reward structure evolution and parameter alignment optimization. To design high-quality reward structures, R maintains a reward function population and modularizes the functional components. LLMs are employed as the mutation operator, and module-level crossover is proposed to facilitate efficient exploration and exploitation. To design more efficient reward parameters, R first leverages LLMs to generate multiple critic functions for trajectory comparison and annotation. Based on these critics, a voting mechanism is employed to collect the trajectory segments with high-confidence labels. These labeled segments are then used to refine the reward function parameters through preference learning. Experiments on diverse robotic control tasks demonstrate that R outperforms strong baselines in both reward design efficiency and quality, surpassing human-designed reward functions.