Right Now, Wrong Then: Non-Stationary Direct Preference Optimization under Preference Drift

Seongho Son, William Bankes, Sayak Ray Chowdhury, Brooks Paige, Ilija Bogunovic
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:56173-56203, 2025.

Abstract

Current Large Language Model (LLM) preference optimization algorithms do not account for temporal preference drift, which can lead to severe misalignment. To address this limitation, we propose Non-Stationary Direct Preference Optimisation (NS-DPO) that models time-dependent reward functions with a Dynamic Bradley-Terry model. NS-DPO proposes a computationally efficient solution by introducing only a single discount parameter in the loss function, which is used for exponential weighting that proportionally focuses learning on more time-relevant datapoints. We theoretically analyze the convergence of NS-DPO in a general setting where the exact nature of the preference drift is not known, providing upper bounds on the estimation error and regret caused by non-stationary preferences. Finally, we demonstrate the effectiveness of NS-DPO for fine-tuning LLMs under drifting preferences. Using scenarios where various levels of preference drift is introduced, with popular LLM reward models and datasets, we show that NS-DPO fine-tuned LLMs remain robust under non-stationarity, significantly outperforming baseline algorithms that ignore temporal preference changes, without sacrificing performance in stationary cases.

Cite this Paper


BibTeX
@InProceedings{pmlr-v267-son25a, title = {Right Now, Wrong Then: Non-Stationary Direct Preference Optimization under Preference Drift}, author = {Son, Seongho and Bankes, William and Ray Chowdhury, Sayak and Paige, Brooks and Bogunovic, Ilija}, booktitle = {Proceedings of the 42nd International Conference on Machine Learning}, pages = {56173--56203}, year = {2025}, editor = {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry}, volume = {267}, series = {Proceedings of Machine Learning Research}, month = {13--19 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v267/main/assets/son25a/son25a.pdf}, url = {https://proceedings.mlr.press/v267/son25a.html}, abstract = {Current Large Language Model (LLM) preference optimization algorithms do not account for temporal preference drift, which can lead to severe misalignment. To address this limitation, we propose Non-Stationary Direct Preference Optimisation (NS-DPO) that models time-dependent reward functions with a Dynamic Bradley-Terry model. NS-DPO proposes a computationally efficient solution by introducing only a single discount parameter in the loss function, which is used for exponential weighting that proportionally focuses learning on more time-relevant datapoints. We theoretically analyze the convergence of NS-DPO in a general setting where the exact nature of the preference drift is not known, providing upper bounds on the estimation error and regret caused by non-stationary preferences. Finally, we demonstrate the effectiveness of NS-DPO for fine-tuning LLMs under drifting preferences. Using scenarios where various levels of preference drift is introduced, with popular LLM reward models and datasets, we show that NS-DPO fine-tuned LLMs remain robust under non-stationarity, significantly outperforming baseline algorithms that ignore temporal preference changes, without sacrificing performance in stationary cases.} }
Endnote
%0 Conference Paper %T Right Now, Wrong Then: Non-Stationary Direct Preference Optimization under Preference Drift %A Seongho Son %A William Bankes %A Sayak Ray Chowdhury %A Brooks Paige %A Ilija Bogunovic %B Proceedings of the 42nd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2025 %E Aarti Singh %E Maryam Fazel %E Daniel Hsu %E Simon Lacoste-Julien %E Felix Berkenkamp %E Tegan Maharaj %E Kiri Wagstaff %E Jerry Zhu %F pmlr-v267-son25a %I PMLR %P 56173--56203 %U https://proceedings.mlr.press/v267/son25a.html %V 267 %X Current Large Language Model (LLM) preference optimization algorithms do not account for temporal preference drift, which can lead to severe misalignment. To address this limitation, we propose Non-Stationary Direct Preference Optimisation (NS-DPO) that models time-dependent reward functions with a Dynamic Bradley-Terry model. NS-DPO proposes a computationally efficient solution by introducing only a single discount parameter in the loss function, which is used for exponential weighting that proportionally focuses learning on more time-relevant datapoints. We theoretically analyze the convergence of NS-DPO in a general setting where the exact nature of the preference drift is not known, providing upper bounds on the estimation error and regret caused by non-stationary preferences. Finally, we demonstrate the effectiveness of NS-DPO for fine-tuning LLMs under drifting preferences. Using scenarios where various levels of preference drift is introduced, with popular LLM reward models and datasets, we show that NS-DPO fine-tuned LLMs remain robust under non-stationarity, significantly outperforming baseline algorithms that ignore temporal preference changes, without sacrificing performance in stationary cases.
APA
Son, S., Bankes, W., Ray Chowdhury, S., Paige, B. & Bogunovic, I.. (2025). Right Now, Wrong Then: Non-Stationary Direct Preference Optimization under Preference Drift. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:56173-56203 Available from https://proceedings.mlr.press/v267/son25a.html.

Related Material