Improving Reward Model Generalization from Adversarial Process Enhanced Preferences

Zhilong Zhang, Tian Xu, Xinghao Du, Xingchen Cao, Yihao Sun, Yang Yu
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:76414-76435, 2025.

Abstract

In sequential decision-making, the reward function serves as the primary supervision signal, guiding agents to acquire the desired behaviors. Traditional reward modeling methods rely heavily on human expertise, limiting their scalability. Automated preference generation from suboptimal demonstrations has emerged as a promising alternative to address this limitation. This approach first generates preference data from suboptimal demonstrations and then trains reward models based on these preferences. Despite its potential, existing methods often struggle to generate preference data with sufficient coverage, limiting the accuracy and generalizability of the resulting reward models. To overcome this limitation, we propose APEC (Automated Preference generation with Enhanced Coverage), a novel method that improves the coverage of preference data. APEC achieves this by selecting policy pairs with significantly different iteration indices from the whole adversarial imitation learning process. We provide a theoretical analysis to validate that the selected policy pairs provably hold preference relationships. Experimental results demonstrate that APEC consistently outperforms baseline methods in generating preferences with broader coverage across both vector-based and pixel-based control tasks. Consequently, the reward models trained with APEC align more closely with ground-truth rewards, deriving improved policy performance.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-zhang25ci, title = {Improving Reward Model Generalization from Adversarial Process Enhanced Preferences}, author = {Zhang, Zhilong and Xu, Tian and Du, Xinghao and Cao, Xingchen and Sun, Yihao and Yu, Yang}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {76414--76435}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/zhang25ci/zhang25ci.pdf}, url = {https://proceedings.mlr.press/v267/zhang25ci.html}, abstract = {In sequential decision-making, the reward function serves as the primary supervision signal, guiding agents to acquire the desired behaviors. Traditional reward modeling methods rely heavily on human expertise, limiting their scalability. Automated preference generation from suboptimal demonstrations has emerged as a promising alternative to address this limitation. This approach first generates preference data from suboptimal demonstrations and then trains reward models based on these preferences. Despite its potential, existing methods often struggle to generate preference data with sufficient coverage, limiting the accuracy and generalizability of the resulting reward models. To overcome this limitation, we propose APEC (Automated Preference generation with Enhanced Coverage), a novel method that improves the coverage of preference data. APEC achieves this by selecting policy pairs with significantly different iteration indices from the whole adversarial imitation learning process. We provide a theoretical analysis to validate that the selected policy pairs provably hold preference relationships. Experimental results demonstrate that APEC consistently outperforms baseline methods in generating preferences with broader coverage across both vector-based and pixel-based control tasks. Consequently, the reward models trained with APEC align more closely with ground-truth rewards, deriving improved policy performance.} }
Endnote
%0 Conference Paper %T Improving Reward Model Generalization from Adversarial Process Enhanced Preferences %A Zhilong Zhang %A Tian Xu %A Xinghao Du %A Xingchen Cao %A Yihao Sun %A Yang Yu %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-zhang25ci %I PMLR %P 76414--76435 %U https://proceedings.mlr.press/v267/zhang25ci.html %V 267 %X In sequential decision-making, the reward function serves as the primary supervision signal, guiding agents to acquire the desired behaviors. Traditional reward modeling methods rely heavily on human expertise, limiting their scalability. Automated preference generation from suboptimal demonstrations has emerged as a promising alternative to address this limitation. This approach first generates preference data from suboptimal demonstrations and then trains reward models based on these preferences. Despite its potential, existing methods often struggle to generate preference data with sufficient coverage, limiting the accuracy and generalizability of the resulting reward models. To overcome this limitation, we propose APEC (Automated Preference generation with Enhanced Coverage), a novel method that improves the coverage of preference data. APEC achieves this by selecting policy pairs with significantly different iteration indices from the whole adversarial imitation learning process. We provide a theoretical analysis to validate that the selected policy pairs provably hold preference relationships. Experimental results demonstrate that APEC consistently outperforms baseline methods in generating preferences with broader coverage across both vector-based and pixel-based control tasks. Consequently, the reward models trained with APEC align more closely with ground-truth rewards, deriving improved policy performance.
APA
Zhang, Z., Xu, T., Du, X., Cao, X., Sun, Y. & Yu, Y.. (2025). Improving Reward Model Generalization from Adversarial Process Enhanced Preferences. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:76414-76435 Available from https://proceedings.mlr.press/v267/zhang25ci.html.

Related Material