On the Robustness of Reward Models for Language Model Alignment

Jiwoo Hong; Noah Lee; Eunki Kim; Guijin Son; Woojin Chung; Aman Gupta; Shao Tang; James Thorne

On the Robustness of Reward Models for Language Model Alignment

Jiwoo Hong, Noah Lee, Eunki Kim, Guijin Son, Woojin Chung, Aman Gupta, Shao Tang, James Thorne

Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:23682-23699, 2025.

Abstract

The Bradley-Terry (BT) model is widely practiced in reward modeling for reinforcement learning with human feedback (RLHF). Despite its effectiveness, reward models (RMs) trained with BT model loss as one-way classifiers are prone to over-optimization, losing generalizability to unseen inputs. In this paper, we study the cause of over-optimization and its downstream effects on the RLHF procedure, highlighting the importance of robustness in RMs. First, we show that the excessive dispersion of hidden state norms is the main source of over-optimization. Correspondingly, we propose batch-wise sum-to-zero regularization (BSR) that enforces reward sum for each batch to be zero-centered, constraining the rewards with abnormally large magnitudes. We assess the impact of BSR in improving robustness in RMs through four scenarios of over-optimization, where BSR consistently manifests better robustness on unseen inputs. Then, we compare the plain BT model and BSR on RLHF training and empirically show that robust RMs better align the policy to the gold preference model. Finally, we apply BSR to high-quality data and models, which surpasses state-of-the-art RMs in the 8B scale by adding more than 5% in complex preference prediction tasks. By conducting RLOO training with 8B RMs, AlpacaEval 2.0, with reducing generation length by 40% while adding a 7% increase in win rate, further highlights that robustness in RMs induces robustness in RLHF training.

Cite this Paper

BibTeX

@InProceedings{pmlr-v267-hong25d,
  title = 	 {On the Robustness of Reward Models for Language Model Alignment},
  author =       {Hong, Jiwoo and Lee, Noah and Kim, Eunki and Son, Guijin and Chung, Woojin and Gupta, Aman and Tang, Shao and Thorne, James},
  booktitle = 	 {Proceedings of the 42nd International Conference on Machine Learning},
  pages = 	 {23682--23699},
  year = 	 {2025},
  editor = 	 {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry},
  volume = 	 {267},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {13--19 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v267/main/assets/hong25d/hong25d.pdf},
  url = 	 {https://proceedings.mlr.press/v267/hong25d.html},
  abstract = 	 {The Bradley-Terry (BT) model is widely practiced in reward modeling for reinforcement learning with human feedback (RLHF). Despite its effectiveness, reward models (RMs) trained with BT model loss as one-way classifiers are prone to over-optimization, losing generalizability to unseen inputs. In this paper, we study the cause of over-optimization and its downstream effects on the RLHF procedure, highlighting the importance of robustness in RMs. First, we show that the excessive dispersion of hidden state norms is the main source of over-optimization. Correspondingly, we propose batch-wise sum-to-zero regularization (BSR) that enforces reward sum for each batch to be zero-centered, constraining the rewards with abnormally large magnitudes. We assess the impact of BSR in improving robustness in RMs through four scenarios of over-optimization, where BSR consistently manifests better robustness on unseen inputs. Then, we compare the plain BT model and BSR on RLHF training and empirically show that robust RMs better align the policy to the gold preference model. Finally, we apply BSR to high-quality data and models, which surpasses state-of-the-art RMs in the 8B scale by adding more than 5% in complex preference prediction tasks. By conducting RLOO training with 8B RMs, AlpacaEval 2.0, with reducing generation length by 40% while adding a 7% increase in win rate, further highlights that robustness in RMs induces robustness in RLHF training.}
}

Endnote

%0 Conference Paper
%T On the Robustness of Reward Models for Language Model Alignment
%A Jiwoo Hong
%A Noah Lee
%A Eunki Kim
%A Guijin Son
%A Woojin Chung
%A Aman Gupta
%A Shao Tang
%A James Thorne
%B Proceedings of the 42nd International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2025
%E Aarti Singh
%E Maryam Fazel
%E Daniel Hsu
%E Simon Lacoste-Julien
%E Felix Berkenkamp
%E Tegan Maharaj
%E Kiri Wagstaff
%E Jerry Zhu	
%F pmlr-v267-hong25d
%I PMLR
%P 23682--23699
%U https://proceedings.mlr.press/v267/hong25d.html
%V 267
%X The Bradley-Terry (BT) model is widely practiced in reward modeling for reinforcement learning with human feedback (RLHF). Despite its effectiveness, reward models (RMs) trained with BT model loss as one-way classifiers are prone to over-optimization, losing generalizability to unseen inputs. In this paper, we study the cause of over-optimization and its downstream effects on the RLHF procedure, highlighting the importance of robustness in RMs. First, we show that the excessive dispersion of hidden state norms is the main source of over-optimization. Correspondingly, we propose batch-wise sum-to-zero regularization (BSR) that enforces reward sum for each batch to be zero-centered, constraining the rewards with abnormally large magnitudes. We assess the impact of BSR in improving robustness in RMs through four scenarios of over-optimization, where BSR consistently manifests better robustness on unseen inputs. Then, we compare the plain BT model and BSR on RLHF training and empirically show that robust RMs better align the policy to the gold preference model. Finally, we apply BSR to high-quality data and models, which surpasses state-of-the-art RMs in the 8B scale by adding more than 5% in complex preference prediction tasks. By conducting RLOO training with 8B RMs, AlpacaEval 2.0, with reducing generation length by 40% while adding a 7% increase in win rate, further highlights that robustness in RMs induces robustness in RLHF training.

APA

Hong, J., Lee, N., Kim, E., Son, G., Chung, W., Gupta, A., Tang, S. & Thorne, J.. (2025). On the Robustness of Reward Models for Language Model Alignment. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:23682-23699 Available from https://proceedings.mlr.press/v267/hong25d.html.

On the Robustness of Reward Models for Language Model Alignment

Abstract

Cite this Paper

Related Material