Principled Reinforcement Learning with Human Feedback from Pairwise or K-wise Comparisons

Banghua Zhu; Michael Jordan; Jiantao Jiao

Principled Reinforcement Learning with Human Feedback from Pairwise or K-wise Comparisons

Banghua Zhu, Michael Jordan, Jiantao Jiao

Proceedings of the 40th International Conference on Machine Learning, PMLR 202:43037-43067, 2023.

Abstract

We provide a theoretical framework for Reinforcement Learning with Human Feedback (RLHF). We show that when the underlying true reward is linear, under both Bradley-Terry-Luce (BTL) model (pairwise comparison) and Plackett-Luce (PL) model ($K$-wise comparison), MLE converges under certain semi-norm for the family of linear reward. On the other hand, when training a policy based on the learned reward model, we show that MLE fails while a pessimistic MLE provides policies with good performance under certain coverage assumption. We also show that under the PL model, both the true MLE and a different MLE which splits the $K$-wise comparison into pairwise comparisons converge, while the true MLE is asymptotically more efficient. Our results validate the empirical success of the existing RLHF algorithms, and provide new insights for algorithm design. Our analysis can also be applied for the problem of online RLHF and inverse reinforcement learning.

Cite this Paper

BibTeX

@InProceedings{pmlr-v202-zhu23f,
  title = 	 {Principled Reinforcement Learning with Human Feedback from Pairwise or K-wise Comparisons},
  author =       {Zhu, Banghua and Jordan, Michael and Jiao, Jiantao},
  booktitle = 	 {Proceedings of the 40th International Conference on Machine Learning},
  pages = 	 {43037--43067},
  year = 	 {2023},
  editor = 	 {Krause, Andreas and Brunskill, Emma and Cho, Kyunghyun and Engelhardt, Barbara and Sabato, Sivan and Scarlett, Jonathan},
  volume = 	 {202},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {23--29 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v202/zhu23f/zhu23f.pdf},
  url = 	 {https://proceedings.mlr.press/v202/zhu23f.html},
  abstract = 	 {We provide a theoretical framework for Reinforcement Learning with Human Feedback (RLHF). We show that when the underlying true reward is linear, under both Bradley-Terry-Luce (BTL) model (pairwise comparison) and Plackett-Luce (PL) model ($K$-wise comparison), MLE converges under certain semi-norm for the family of linear reward. On the other hand, when training a policy based on the learned reward model, we show that MLE fails while a pessimistic MLE provides policies with good performance under certain coverage assumption. We also show that under the PL model, both the true MLE and a different MLE which splits the $K$-wise comparison into pairwise comparisons converge, while the true MLE is asymptotically more efficient. Our results validate the empirical success of the existing RLHF algorithms, and provide new insights for algorithm design. Our analysis can also be applied for the problem of online RLHF and inverse reinforcement learning.}
}

Endnote

%0 Conference Paper
%T Principled Reinforcement Learning with Human Feedback from Pairwise or K-wise Comparisons
%A Banghua Zhu
%A Michael Jordan
%A Jiantao Jiao
%B Proceedings of the 40th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2023
%E Andreas Krause
%E Emma Brunskill
%E Kyunghyun Cho
%E Barbara Engelhardt
%E Sivan Sabato
%E Jonathan Scarlett	
%F pmlr-v202-zhu23f
%I PMLR
%P 43037--43067
%U https://proceedings.mlr.press/v202/zhu23f.html
%V 202
%X We provide a theoretical framework for Reinforcement Learning with Human Feedback (RLHF). We show that when the underlying true reward is linear, under both Bradley-Terry-Luce (BTL) model (pairwise comparison) and Plackett-Luce (PL) model ($K$-wise comparison), MLE converges under certain semi-norm for the family of linear reward. On the other hand, when training a policy based on the learned reward model, we show that MLE fails while a pessimistic MLE provides policies with good performance under certain coverage assumption. We also show that under the PL model, both the true MLE and a different MLE which splits the $K$-wise comparison into pairwise comparisons converge, while the true MLE is asymptotically more efficient. Our results validate the empirical success of the existing RLHF algorithms, and provide new insights for algorithm design. Our analysis can also be applied for the problem of online RLHF and inverse reinforcement learning.

APA

Zhu, B., Jordan, M. & Jiao, J.. (2023). Principled Reinforcement Learning with Human Feedback from Pairwise or K-wise Comparisons. Proceedings of the 40th International Conference on Machine Learning, in Proceedings of Machine Learning Research 202:43037-43067 Available from https://proceedings.mlr.press/v202/zhu23f.html.

Principled Reinforcement Learning with Human Feedback from Pairwise or K-wise Comparisons

Abstract

Cite this Paper

Related Material