RuleAdapter: Dynamic Rules for training Safety Reward Models in RLHF

Xiaomin Li, Mingye Gao, Zhiwei Zhang, Jingxuan Fan, Weiyu Li
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:34355-34378, 2025.

Abstract

Reinforcement Learning from Human Feedback (RLHF) is widely used to align models with human preferences, particularly to enhance the safety of responses generated by LLMs. This method traditionally relies on choosing preferred responses from response pairs. However, due to variations in human opinions and the difficulty of making an overall comparison of two responses, there is a growing shift towards a fine-grained annotation approach, assessing responses based on multiple specific metrics or rules. Selecting and applying these rules efficiently while accommodating the diversity of preference data remains a significant challenge. In this paper, we introduce a dynamic approach that adaptively selects the most critical rules for each pair of responses. We develop a mathematical framework that leverages the maximum discrepancy between each paired responses and theoretically show that this strategy optimizes the mutual information between the rule-based labeling and the hidden ground-truth preferences. We then train an 8B reward model using the adaptively labeled preference dataset and evaluate its performance on RewardBench. As of May 25, 2025, our model achieved the highest safety performance on the leaderboard, outperforming various larger models.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-li25o, title = {{R}ule{A}dapter: Dynamic Rules for training Safety Reward Models in {RLHF}}, author = {Li, Xiaomin and Gao, Mingye and Zhang, Zhiwei and Fan, Jingxuan and Li, Weiyu}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {34355--34378}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/li25o/li25o.pdf}, url = {https://proceedings.mlr.press/v267/li25o.html}, abstract = {Reinforcement Learning from Human Feedback (RLHF) is widely used to align models with human preferences, particularly to enhance the safety of responses generated by LLMs. This method traditionally relies on choosing preferred responses from response pairs. However, due to variations in human opinions and the difficulty of making an overall comparison of two responses, there is a growing shift towards a fine-grained annotation approach, assessing responses based on multiple specific metrics or rules. Selecting and applying these rules efficiently while accommodating the diversity of preference data remains a significant challenge. In this paper, we introduce a dynamic approach that adaptively selects the most critical rules for each pair of responses. We develop a mathematical framework that leverages the maximum discrepancy between each paired responses and theoretically show that this strategy optimizes the mutual information between the rule-based labeling and the hidden ground-truth preferences. We then train an 8B reward model using the adaptively labeled preference dataset and evaluate its performance on RewardBench. As of May 25, 2025, our model achieved the highest safety performance on the leaderboard, outperforming various larger models.} }
Endnote
%0 Conference Paper %T RuleAdapter: Dynamic Rules for training Safety Reward Models in RLHF %A Xiaomin Li %A Mingye Gao %A Zhiwei Zhang %A Jingxuan Fan %A Weiyu Li %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-li25o %I PMLR %P 34355--34378 %U https://proceedings.mlr.press/v267/li25o.html %V 267 %X Reinforcement Learning from Human Feedback (RLHF) is widely used to align models with human preferences, particularly to enhance the safety of responses generated by LLMs. This method traditionally relies on choosing preferred responses from response pairs. However, due to variations in human opinions and the difficulty of making an overall comparison of two responses, there is a growing shift towards a fine-grained annotation approach, assessing responses based on multiple specific metrics or rules. Selecting and applying these rules efficiently while accommodating the diversity of preference data remains a significant challenge. In this paper, we introduce a dynamic approach that adaptively selects the most critical rules for each pair of responses. We develop a mathematical framework that leverages the maximum discrepancy between each paired responses and theoretically show that this strategy optimizes the mutual information between the rule-based labeling and the hidden ground-truth preferences. We then train an 8B reward model using the adaptively labeled preference dataset and evaluate its performance on RewardBench. As of May 25, 2025, our model achieved the highest safety performance on the leaderboard, outperforming various larger models.
APA
Li, X., Gao, M., Zhang, Z., Fan, J. & Li, W.. (2025). RuleAdapter: Dynamic Rules for training Safety Reward Models in RLHF. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:34355-34378 Available from https://proceedings.mlr.press/v267/li25o.html.

Related Material