Active Reward Modeling: Adaptive Preference Labeling for Large Language Model Alignment

Yunyi Shen; Hao Sun; Jean-Francois Ton

Active Reward Modeling: Adaptive Preference Labeling for Large Language Model Alignment

Yunyi Shen, Hao Sun, Jean-Francois Ton

Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:54410-54430, 2025.

Abstract

Building neural reward models from human preferences is a pivotal component in reinforcement learning from human feedback (RLHF) and large language model alignment research. Given the scarcity and high cost of human annotation, how to select the most informative pairs to annotate is an essential yet challenging open problem. In this work, we highlight the insight that an ideal comparison dataset for reward modeling should balance exploration of the representation space and make informative comparisons between pairs with moderate reward differences. Technically, challenges arise in quantifying the two objectives and efficiently prioritizing the comparisons to be annotated. To address this, we propose the Fisher information-based selection strategies, adapt theories from the classical experimental design literature, and apply them to the final linear layer of the deep neural network-based reward modeling tasks. Empirically, our method demonstrates remarkable performance, high computational efficiency, and stability compared to other selection methods from deep learning and classical statistical literature across multiple open-source LLMs and datasets. Further ablation studies reveal that incorporating cross-prompt comparisons in active reward modeling significantly enhances labeling efficiency, shedding light on the potential for improved annotation strategies in RLHF. Code and embeddings to reproduce all results of this paper are available at https://github.com/YunyiShen/ARM-FI/.

Cite this Paper

BibTeX

@InProceedings{pmlr-v267-shen25c,
  title = 	 {Active Reward Modeling: Adaptive Preference Labeling for Large Language Model Alignment},
  author =       {Shen, Yunyi and Sun, Hao and Ton, Jean-Francois},
  booktitle = 	 {Proceedings of the 42nd International Conference on Machine Learning},
  pages = 	 {54410--54430},
  year = 	 {2025},
  editor = 	 {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry},
  volume = 	 {267},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {13--19 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v267/main/assets/shen25c/shen25c.pdf},
  url = 	 {https://proceedings.mlr.press/v267/shen25c.html},
  abstract = 	 {Building neural reward models from human preferences is a pivotal component in reinforcement learning from human feedback (RLHF) and large language model alignment research. Given the scarcity and high cost of human annotation, how to select the most informative pairs to annotate is an essential yet challenging open problem. In this work, we highlight the insight that an ideal comparison dataset for reward modeling should balance exploration of the representation space and make informative comparisons between pairs with moderate reward differences. Technically, challenges arise in quantifying the two objectives and efficiently prioritizing the comparisons to be annotated. To address this, we propose the Fisher information-based selection strategies, adapt theories from the classical experimental design literature, and apply them to the final linear layer of the deep neural network-based reward modeling tasks. Empirically, our method demonstrates remarkable performance, high computational efficiency, and stability compared to other selection methods from deep learning and classical statistical literature across multiple open-source LLMs and datasets. Further ablation studies reveal that incorporating cross-prompt comparisons in active reward modeling significantly enhances labeling efficiency, shedding light on the potential for improved annotation strategies in RLHF. Code and embeddings to reproduce all results of this paper are available at https://github.com/YunyiShen/ARM-FI/.}
}

Endnote

%0 Conference Paper
%T Active Reward Modeling: Adaptive Preference Labeling for Large Language Model Alignment
%A Yunyi Shen
%A Hao Sun
%A Jean-Francois Ton
%B Proceedings of the 42nd International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2025
%E Aarti Singh
%E Maryam Fazel
%E Daniel Hsu
%E Simon Lacoste-Julien
%E Felix Berkenkamp
%E Tegan Maharaj
%E Kiri Wagstaff
%E Jerry Zhu	
%F pmlr-v267-shen25c
%I PMLR
%P 54410--54430
%U https://proceedings.mlr.press/v267/shen25c.html
%V 267
%X Building neural reward models from human preferences is a pivotal component in reinforcement learning from human feedback (RLHF) and large language model alignment research. Given the scarcity and high cost of human annotation, how to select the most informative pairs to annotate is an essential yet challenging open problem. In this work, we highlight the insight that an ideal comparison dataset for reward modeling should balance exploration of the representation space and make informative comparisons between pairs with moderate reward differences. Technically, challenges arise in quantifying the two objectives and efficiently prioritizing the comparisons to be annotated. To address this, we propose the Fisher information-based selection strategies, adapt theories from the classical experimental design literature, and apply them to the final linear layer of the deep neural network-based reward modeling tasks. Empirically, our method demonstrates remarkable performance, high computational efficiency, and stability compared to other selection methods from deep learning and classical statistical literature across multiple open-source LLMs and datasets. Further ablation studies reveal that incorporating cross-prompt comparisons in active reward modeling significantly enhances labeling efficiency, shedding light on the potential for improved annotation strategies in RLHF. Code and embeddings to reproduce all results of this paper are available at https://github.com/YunyiShen/ARM-FI/.

APA

Shen, Y., Sun, H. & Ton, J.. (2025). Active Reward Modeling: Adaptive Preference Labeling for Large Language Model Alignment. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:54410-54430 Available from https://proceedings.mlr.press/v267/shen25c.html.

Active Reward Modeling: Adaptive Preference Labeling for Large Language Model Alignment

Abstract

Cite this Paper

Related Material