[edit]
On the Robustness of Reward Models for Language Model Alignment
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:23682-23699, 2025.
Abstract
The Bradley-Terry (BT) model is widely practiced in reward modeling for reinforcement learning with human feedback (RLHF). Despite its effectiveness, reward models (RMs) trained with BT model loss as one-way classifiers are prone to over-optimization, losing generalizability to unseen inputs. In this paper, we study the cause of over-optimization and its downstream effects on the RLHF procedure, highlighting the importance of robustness in RMs. First, we show that the excessive dispersion of hidden state norms is the main source of over-optimization. Correspondingly, we propose batch-wise sum-to-zero regularization (BSR) that enforces reward sum for each batch to be zero-centered, constraining the rewards with abnormally large magnitudes. We assess the impact of BSR in improving robustness in RMs through four scenarios of over-optimization, where BSR consistently manifests better robustness on unseen inputs. Then, we compare the plain BT model and BSR on RLHF training and empirically show that robust RMs better align the policy to the gold preference model. Finally, we apply BSR to high-quality data and models, which surpasses state-of-the-art RMs in the 8B scale by adding more than 5% in complex preference prediction tasks. By conducting RLOO training with 8B RMs, AlpacaEval 2.0, with reducing generation length by 40% while adding a 7% increase in win rate, further highlights that robustness in RMs induces robustness in RLHF training.