Principled Reinforcement Learning with Human Feedback from Pairwise or K-wise Comparisons

Banghua Zhu, Michael Jordan, Jiantao Jiao
Proceedings of the 40th International Conference on Machine Learning, PMLR 202:43037-43067, 2023.

Abstract

We provide a theoretical framework for Reinforcement Learning with Human Feedback (RLHF). We show that when the underlying true reward is linear, under both Bradley-Terry-Luce (BTL) model (pairwise comparison) and Plackett-Luce (PL) model ($K$-wise comparison), MLE converges under certain semi-norm for the family of linear reward. On the other hand, when training a policy based on the learned reward model, we show that MLE fails while a pessimistic MLE provides policies with good performance under certain coverage assumption. We also show that under the PL model, both the true MLE and a different MLE which splits the $K$-wise comparison into pairwise comparisons converge, while the true MLE is asymptotically more efficient. Our results validate the empirical success of the existing RLHF algorithms, and provide new insights for algorithm design. Our analysis can also be applied for the problem of online RLHF and inverse reinforcement learning.

Cite this Paper


BibTeX
@InProceedings{pmlr-v202-zhu23f, title = {Principled Reinforcement Learning with Human Feedback from Pairwise or K-wise Comparisons}, author = {Zhu, Banghua and Jordan, Michael and Jiao, Jiantao}, booktitle = {Proceedings of the 40th International Conference on Machine Learning}, pages = {43037--43067}, year = {2023}, editor = {Krause, Andreas and Brunskill, Emma and Cho, Kyunghyun and Engelhardt, Barbara and Sabato, Sivan and Scarlett, Jonathan}, volume = {202}, series = {Proceedings of Machine Learning Research}, month = {23--29 Jul}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v202/zhu23f/zhu23f.pdf}, url = {https://proceedings.mlr.press/v202/zhu23f.html}, abstract = {We provide a theoretical framework for Reinforcement Learning with Human Feedback (RLHF). We show that when the underlying true reward is linear, under both Bradley-Terry-Luce (BTL) model (pairwise comparison) and Plackett-Luce (PL) model ($K$-wise comparison), MLE converges under certain semi-norm for the family of linear reward. On the other hand, when training a policy based on the learned reward model, we show that MLE fails while a pessimistic MLE provides policies with good performance under certain coverage assumption. We also show that under the PL model, both the true MLE and a different MLE which splits the $K$-wise comparison into pairwise comparisons converge, while the true MLE is asymptotically more efficient. Our results validate the empirical success of the existing RLHF algorithms, and provide new insights for algorithm design. Our analysis can also be applied for the problem of online RLHF and inverse reinforcement learning.} }
Endnote
%0 Conference Paper %T Principled Reinforcement Learning with Human Feedback from Pairwise or K-wise Comparisons %A Banghua Zhu %A Michael Jordan %A Jiantao Jiao %B Proceedings of the 40th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2023 %E Andreas Krause %E Emma Brunskill %E Kyunghyun Cho %E Barbara Engelhardt %E Sivan Sabato %E Jonathan Scarlett %F pmlr-v202-zhu23f %I PMLR %P 43037--43067 %U https://proceedings.mlr.press/v202/zhu23f.html %V 202 %X We provide a theoretical framework for Reinforcement Learning with Human Feedback (RLHF). We show that when the underlying true reward is linear, under both Bradley-Terry-Luce (BTL) model (pairwise comparison) and Plackett-Luce (PL) model ($K$-wise comparison), MLE converges under certain semi-norm for the family of linear reward. On the other hand, when training a policy based on the learned reward model, we show that MLE fails while a pessimistic MLE provides policies with good performance under certain coverage assumption. We also show that under the PL model, both the true MLE and a different MLE which splits the $K$-wise comparison into pairwise comparisons converge, while the true MLE is asymptotically more efficient. Our results validate the empirical success of the existing RLHF algorithms, and provide new insights for algorithm design. Our analysis can also be applied for the problem of online RLHF and inverse reinforcement learning.
APA
Zhu, B., Jordan, M. & Jiao, J.. (2023). Principled Reinforcement Learning with Human Feedback from Pairwise or K-wise Comparisons. Proceedings of the 40th International Conference on Machine Learning, in Proceedings of Machine Learning Research 202:43037-43067 Available from https://proceedings.mlr.press/v202/zhu23f.html.

Related Material