SIMPLEMIX: Frustratingly Simple Mixing of Off- and On-policy Data in Language Model Preference Learning

Tianjian Li, Daniel Khashabi
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:35081-35100, 2025.

Abstract

Aligning language models with human preferences relies on pairwise preference datasets. While some studies suggest that on-policy data consistently outperforms off-policy data for preference learning, others indicate that the advantages of on-policy data are task-dependent, highlighting the need for a systematic exploration of their interplay. In this work, we show that on-policy and off-policy data offer complementary strengths: on-policy data is particularly effective for reasoning tasks like math and coding, while off-policy data performs better on subjective tasks such as creative writing and making personal recommendations. Guided by these findings, we introduce SimpleMix, an approach to combine the complementary strengths of on-policy and off-policy preference learning by simply mixing these two data sources. Our empirical results across diverse tasks and benchmarks demonstrate that SimpleMix substantially improves language model alignment. Specifically, SimpleMix improves upon on-policy DPO and off-policy DPO by an average of 6.03 on Alpaca Eval 2.0. Moreover, it surpasses prior approaches that are much more complex in combining on- and off-policy data, such as HyPO and DPO-Mix-P, by an average of 3.05. These findings validate the effectiveness and efficiency of SimpleMix for enhancing preference-based alignment.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-li25au, title = {{SIMPLEMIX}: Frustratingly Simple Mixing of Off- and On-policy Data in Language Model Preference Learning}, author = {Li, Tianjian and Khashabi, Daniel}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {35081--35100}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/li25au/li25au.pdf}, url = {https://proceedings.mlr.press/v267/li25au.html}, abstract = {Aligning language models with human preferences relies on pairwise preference datasets. While some studies suggest that on-policy data consistently outperforms off-policy data for preference learning, others indicate that the advantages of on-policy data are task-dependent, highlighting the need for a systematic exploration of their interplay. In this work, we show that on-policy and off-policy data offer complementary strengths: on-policy data is particularly effective for reasoning tasks like math and coding, while off-policy data performs better on subjective tasks such as creative writing and making personal recommendations. Guided by these findings, we introduce SimpleMix, an approach to combine the complementary strengths of on-policy and off-policy preference learning by simply mixing these two data sources. Our empirical results across diverse tasks and benchmarks demonstrate that SimpleMix substantially improves language model alignment. Specifically, SimpleMix improves upon on-policy DPO and off-policy DPO by an average of 6.03 on Alpaca Eval 2.0. Moreover, it surpasses prior approaches that are much more complex in combining on- and off-policy data, such as HyPO and DPO-Mix-P, by an average of 3.05. These findings validate the effectiveness and efficiency of SimpleMix for enhancing preference-based alignment.} }
Endnote
%0 Conference Paper %T SIMPLEMIX: Frustratingly Simple Mixing of Off- and On-policy Data in Language Model Preference Learning %A Tianjian Li %A Daniel Khashabi %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-li25au %I PMLR %P 35081--35100 %U https://proceedings.mlr.press/v267/li25au.html %V 267 %X Aligning language models with human preferences relies on pairwise preference datasets. While some studies suggest that on-policy data consistently outperforms off-policy data for preference learning, others indicate that the advantages of on-policy data are task-dependent, highlighting the need for a systematic exploration of their interplay. In this work, we show that on-policy and off-policy data offer complementary strengths: on-policy data is particularly effective for reasoning tasks like math and coding, while off-policy data performs better on subjective tasks such as creative writing and making personal recommendations. Guided by these findings, we introduce SimpleMix, an approach to combine the complementary strengths of on-policy and off-policy preference learning by simply mixing these two data sources. Our empirical results across diverse tasks and benchmarks demonstrate that SimpleMix substantially improves language model alignment. Specifically, SimpleMix improves upon on-policy DPO and off-policy DPO by an average of 6.03 on Alpaca Eval 2.0. Moreover, it surpasses prior approaches that are much more complex in combining on- and off-policy data, such as HyPO and DPO-Mix-P, by an average of 3.05. These findings validate the effectiveness and efficiency of SimpleMix for enhancing preference-based alignment.
APA
Li, T. & Khashabi, D.. (2025). SIMPLEMIX: Frustratingly Simple Mixing of Off- and On-policy Data in Language Model Preference Learning. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:35081-35100 Available from https://proceedings.mlr.press/v267/li25au.html.

Related Material