CLARIFY: Contrastive Preference Reinforcement Learning for Untangling Ambiguous Queries

Ni Mu, Hao Hu, Xiao Hu, Yiqin Yang, Bo Xu, Qing-Shan Jia
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:45050-45068, 2025.

Abstract

Preference-based reinforcement learning (PbRL) bypasses explicit reward engineering by inferring reward functions from human preference comparisons, enabling better alignment with human intentions. However, humans often struggle to label a clear preference between similar segments, reducing label efficiency and limiting PbRL’s real-world applicability. To address this, we propose an offline PbRL method: Contrastive LeArning for ResolvIng Ambiguous Feedback (CLARIFY), which learns a trajectory embedding space that incorporates preference information, ensuring clearly distinguished segments are spaced apart, thus facilitating the selection of more unambiguous queries. Extensive experiments demonstrate that CLARIFY outperforms baselines in both non-ideal teachers and real human feedback settings. Our approach not only selects more distinguished queries but also learns meaningful trajectory embeddings.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-mu25a, title = {{CLARIFY}: Contrastive Preference Reinforcement Learning for Untangling Ambiguous Queries}, author = {Mu, Ni and Hu, Hao and Hu, Xiao and Yang, Yiqin and Xu, Bo and Jia, Qing-Shan}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {45050--45068}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/mu25a/mu25a.pdf}, url = {https://proceedings.mlr.press/v267/mu25a.html}, abstract = {Preference-based reinforcement learning (PbRL) bypasses explicit reward engineering by inferring reward functions from human preference comparisons, enabling better alignment with human intentions. However, humans often struggle to label a clear preference between similar segments, reducing label efficiency and limiting PbRL’s real-world applicability. To address this, we propose an offline PbRL method: Contrastive LeArning for ResolvIng Ambiguous Feedback (CLARIFY), which learns a trajectory embedding space that incorporates preference information, ensuring clearly distinguished segments are spaced apart, thus facilitating the selection of more unambiguous queries. Extensive experiments demonstrate that CLARIFY outperforms baselines in both non-ideal teachers and real human feedback settings. Our approach not only selects more distinguished queries but also learns meaningful trajectory embeddings.} }
Endnote
%0 Conference Paper %T CLARIFY: Contrastive Preference Reinforcement Learning for Untangling Ambiguous Queries %A Ni Mu %A Hao Hu %A Xiao Hu %A Yiqin Yang %A Bo Xu %A Qing-Shan Jia %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-mu25a %I PMLR %P 45050--45068 %U https://proceedings.mlr.press/v267/mu25a.html %V 267 %X Preference-based reinforcement learning (PbRL) bypasses explicit reward engineering by inferring reward functions from human preference comparisons, enabling better alignment with human intentions. However, humans often struggle to label a clear preference between similar segments, reducing label efficiency and limiting PbRL’s real-world applicability. To address this, we propose an offline PbRL method: Contrastive LeArning for ResolvIng Ambiguous Feedback (CLARIFY), which learns a trajectory embedding space that incorporates preference information, ensuring clearly distinguished segments are spaced apart, thus facilitating the selection of more unambiguous queries. Extensive experiments demonstrate that CLARIFY outperforms baselines in both non-ideal teachers and real human feedback settings. Our approach not only selects more distinguished queries but also learns meaningful trajectory embeddings.
APA
Mu, N., Hu, H., Hu, X., Yang, Y., Xu, B. & Jia, Q.. (2025). CLARIFY: Contrastive Preference Reinforcement Learning for Untangling Ambiguous Queries. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:45050-45068 Available from https://proceedings.mlr.press/v267/mu25a.html.

Related Material