Design Considerations in Offline Preference-based RL

Alekh Agarwal, Christoph Dann, Teodor Vanislavov Marinov
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:499-512, 2025.

Abstract

Offline algorithms for Reinforcement Learning from Human Preferences (RLHF), which use only a fixed dataset of sampled responses given an input, and preference feedback among these responses, have gained increasing prominence in the literature on aligning language models. In this paper, we study how the different design choices made in methods such as DPO, IPO, SLiC and many variants influence the quality of the learned policy, from a theoretical perspective. Our treatment yields insights into the choices of loss function, the policy which is used to normalize log-likelihoods, and also the role of the data sampling policy. Notably, our results do not rely on the standard reparameterization-style arguments used to motivate some of the algorithms in this family, which allows us to give a unified treatment to a broad class of methods. We also conduct a small empirical study to verify some of the theoretical findings on a standard summarization benchmark.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-agarwal25a, title = {Design Considerations in Offline Preference-based {RL}}, author = {Agarwal, Alekh and Dann, Christoph and Marinov, Teodor Vanislavov}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {499--512}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/agarwal25a/agarwal25a.pdf}, url = {https://proceedings.mlr.press/v267/agarwal25a.html}, abstract = {Offline algorithms for Reinforcement Learning from Human Preferences (RLHF), which use only a fixed dataset of sampled responses given an input, and preference feedback among these responses, have gained increasing prominence in the literature on aligning language models. In this paper, we study how the different design choices made in methods such as DPO, IPO, SLiC and many variants influence the quality of the learned policy, from a theoretical perspective. Our treatment yields insights into the choices of loss function, the policy which is used to normalize log-likelihoods, and also the role of the data sampling policy. Notably, our results do not rely on the standard reparameterization-style arguments used to motivate some of the algorithms in this family, which allows us to give a unified treatment to a broad class of methods. We also conduct a small empirical study to verify some of the theoretical findings on a standard summarization benchmark.} }
Endnote
%0 Conference Paper %T Design Considerations in Offline Preference-based RL %A Alekh Agarwal %A Christoph Dann %A Teodor Vanislavov Marinov %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-agarwal25a %I PMLR %P 499--512 %U https://proceedings.mlr.press/v267/agarwal25a.html %V 267 %X Offline algorithms for Reinforcement Learning from Human Preferences (RLHF), which use only a fixed dataset of sampled responses given an input, and preference feedback among these responses, have gained increasing prominence in the literature on aligning language models. In this paper, we study how the different design choices made in methods such as DPO, IPO, SLiC and many variants influence the quality of the learned policy, from a theoretical perspective. Our treatment yields insights into the choices of loss function, the policy which is used to normalize log-likelihoods, and also the role of the data sampling policy. Notably, our results do not rely on the standard reparameterization-style arguments used to motivate some of the algorithms in this family, which allows us to give a unified treatment to a broad class of methods. We also conduct a small empirical study to verify some of the theoretical findings on a standard summarization benchmark.
APA
Agarwal, A., Dann, C. & Marinov, T.V.. (2025). Design Considerations in Offline Preference-based RL. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:499-512 Available from https://proceedings.mlr.press/v267/agarwal25a.html.

Related Material