Listwise Reward Estimation for Offline Preference-based Reinforcement Learning

Heewoong Choi; Sangwon Jung; Hongjoon Ahn; Taesup Moon

Listwise Reward Estimation for Offline Preference-based Reinforcement Learning

Heewoong Choi, Sangwon Jung, Hongjoon Ahn, Taesup Moon

Proceedings of the 41st International Conference on Machine Learning, PMLR 235:8651-8671, 2024.

Abstract

In Reinforcement Learning (RL), designing precise reward functions remains to be a challenge, particularly when aligning with human intent. Preference-based RL (PbRL) was introduced to address this problem by learning reward models from human feedback. However, existing PbRL methods have limitations as they often overlook the second-order preference that indicates the relative strength of preference. In this paper, we propose Listwise Reward Estimation (LiRE), a novel approach for offline PbRL that leverages second-order preference information by constructing a Ranked List of Trajectories (RLT), which can be efficiently built by using the same ternary feedback type as traditional methods. To validate the effectiveness of LiRE, we propose a new offline PbRL dataset that objectively reflects the effect of the estimated rewards. Our extensive experiments on the dataset demonstrate the superiority of LiRE, i.e., outperforming state-of-the-art baselines even with modest feedback budgets and enjoying robustness with respect to the number of feedbacks and feedback noise. Our code is available at https://github.com/chwoong/LiRE

Cite this Paper

BibTeX


@InProceedings{pmlr-v235-choi24b,
  title = 	 {Listwise Reward Estimation for Offline Preference-based Reinforcement Learning},
  author =       {Choi, Heewoong and Jung, Sangwon and Ahn, Hongjoon and Moon, Taesup},
  booktitle = 	 {Proceedings of the 41st International Conference on Machine Learning},
  pages = 	 {8651--8671},
  year = 	 {2024},
  editor = 	 {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix},
  volume = 	 {235},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {21--27 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v235/main/assets/choi24b/choi24b.pdf},
  url = 	 {https://proceedings.mlr.press/v235/choi24b.html},
  abstract = 	 {In Reinforcement Learning (RL), designing precise reward functions remains to be a challenge, particularly when aligning with human intent. Preference-based RL (PbRL) was introduced to address this problem by learning reward models from human feedback. However, existing PbRL methods have limitations as they often overlook the second-order preference that indicates the relative strength of preference. In this paper, we propose Listwise Reward Estimation (LiRE), a novel approach for offline PbRL that leverages second-order preference information by constructing a Ranked List of Trajectories (RLT), which can be efficiently built by using the same ternary feedback type as traditional methods. To validate the effectiveness of LiRE, we propose a new offline PbRL dataset that objectively reflects the effect of the estimated rewards. Our extensive experiments on the dataset demonstrate the superiority of LiRE, i.e., outperforming state-of-the-art baselines even with modest feedback budgets and enjoying robustness with respect to the number of feedbacks and feedback noise. Our code is available at https://github.com/chwoong/LiRE}
}

Endnote

%0 Conference Paper
%T Listwise Reward Estimation for Offline Preference-based Reinforcement Learning
%A Heewoong Choi
%A Sangwon Jung
%A Hongjoon Ahn
%A Taesup Moon
%B Proceedings of the 41st International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2024
%E Ruslan Salakhutdinov
%E Zico Kolter
%E Katherine Heller
%E Adrian Weller
%E Nuria Oliver
%E Jonathan Scarlett
%E Felix Berkenkamp	
%F pmlr-v235-choi24b
%I PMLR
%P 8651--8671
%U https://proceedings.mlr.press/v235/choi24b.html
%V 235
%X In Reinforcement Learning (RL), designing precise reward functions remains to be a challenge, particularly when aligning with human intent. Preference-based RL (PbRL) was introduced to address this problem by learning reward models from human feedback. However, existing PbRL methods have limitations as they often overlook the second-order preference that indicates the relative strength of preference. In this paper, we propose Listwise Reward Estimation (LiRE), a novel approach for offline PbRL that leverages second-order preference information by constructing a Ranked List of Trajectories (RLT), which can be efficiently built by using the same ternary feedback type as traditional methods. To validate the effectiveness of LiRE, we propose a new offline PbRL dataset that objectively reflects the effect of the estimated rewards. Our extensive experiments on the dataset demonstrate the superiority of LiRE, i.e., outperforming state-of-the-art baselines even with modest feedback budgets and enjoying robustness with respect to the number of feedbacks and feedback noise. Our code is available at https://github.com/chwoong/LiRE

APA


Choi, H., Jung, S., Ahn, H. & Moon, T.. (2024). Listwise Reward Estimation for Offline Preference-based Reinforcement Learning. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:8651-8671 Available from https://proceedings.mlr.press/v235/choi24b.html.

Listwise Reward Estimation for Offline Preference-based Reinforcement Learning

Abstract

Cite this Paper

Related Material