Right Now, Wrong Then: Non-Stationary Direct Preference Optimization under Preference Drift

Seongho Son; William Bankes; Sayak Ray Chowdhury; Brooks Paige; Ilija Bogunovic

Right Now, Wrong Then: Non-Stationary Direct Preference Optimization under Preference Drift

Seongho Son, William Bankes, Sayak Ray Chowdhury, Brooks Paige, Ilija Bogunovic

Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:56173-56203, 2025.

Abstract

Current Large Language Model (LLM) preference optimization algorithms do not account for temporal preference drift, which can lead to severe misalignment. To address this limitation, we propose Non-Stationary Direct Preference Optimisation (NS-DPO) that models time-dependent reward functions with a Dynamic Bradley-Terry model. NS-DPO proposes a computationally efficient solution by introducing only a single discount parameter in the loss function, which is used for exponential weighting that proportionally focuses learning on more time-relevant datapoints. We theoretically analyze the convergence of NS-DPO in a general setting where the exact nature of the preference drift is not known, providing upper bounds on the estimation error and regret caused by non-stationary preferences. Finally, we demonstrate the effectiveness of NS-DPO for fine-tuning LLMs under drifting preferences. Using scenarios where various levels of preference drift is introduced, with popular LLM reward models and datasets, we show that NS-DPO fine-tuned LLMs remain robust under non-stationarity, significantly outperforming baseline algorithms that ignore temporal preference changes, without sacrificing performance in stationary cases.

Cite this Paper

BibTeX

@InProceedings{pmlr-v267-son25a,
  title = 	 {Right Now, Wrong Then: Non-Stationary Direct Preference Optimization under Preference Drift},
  author =       {Son, Seongho and Bankes, William and Ray Chowdhury, Sayak and Paige, Brooks and Bogunovic, Ilija},
  booktitle = 	 {Proceedings of the 42nd International Conference on Machine Learning},
  pages = 	 {56173--56203},
  year = 	 {2025},
  editor = 	 {Singh, Aarti and Fazel, Maryam and Hsu, Daniel and Lacoste-Julien, Simon and Berkenkamp, Felix and Maharaj, Tegan and Wagstaff, Kiri and Zhu, Jerry},
  volume = 	 {267},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {13--19 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v267/main/assets/son25a/son25a.pdf},
  url = 	 {https://proceedings.mlr.press/v267/son25a.html},
  abstract = 	 {Current Large Language Model (LLM) preference optimization algorithms do not account for temporal preference drift, which can lead to severe misalignment. To address this limitation, we propose Non-Stationary Direct Preference Optimisation (NS-DPO) that models time-dependent reward functions with a Dynamic Bradley-Terry model. NS-DPO proposes a computationally efficient solution by introducing only a single discount parameter in the loss function, which is used for exponential weighting that proportionally focuses learning on more time-relevant datapoints. We theoretically analyze the convergence of NS-DPO in a general setting where the exact nature of the preference drift is not known, providing upper bounds on the estimation error and regret caused by non-stationary preferences. Finally, we demonstrate the effectiveness of NS-DPO for fine-tuning LLMs under drifting preferences. Using scenarios where various levels of preference drift is introduced, with popular LLM reward models and datasets, we show that NS-DPO fine-tuned LLMs remain robust under non-stationarity, significantly outperforming baseline algorithms that ignore temporal preference changes, without sacrificing performance in stationary cases.}
}

Endnote

%0 Conference Paper
%T Right Now, Wrong Then: Non-Stationary Direct Preference Optimization under Preference Drift
%A Seongho Son
%A William Bankes
%A Sayak Ray Chowdhury
%A Brooks Paige
%A Ilija Bogunovic
%B Proceedings of the 42nd International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2025
%E Aarti Singh
%E Maryam Fazel
%E Daniel Hsu
%E Simon Lacoste-Julien
%E Felix Berkenkamp
%E Tegan Maharaj
%E Kiri Wagstaff
%E Jerry Zhu	
%F pmlr-v267-son25a
%I PMLR
%P 56173--56203
%U https://proceedings.mlr.press/v267/son25a.html
%V 267
%X Current Large Language Model (LLM) preference optimization algorithms do not account for temporal preference drift, which can lead to severe misalignment. To address this limitation, we propose Non-Stationary Direct Preference Optimisation (NS-DPO) that models time-dependent reward functions with a Dynamic Bradley-Terry model. NS-DPO proposes a computationally efficient solution by introducing only a single discount parameter in the loss function, which is used for exponential weighting that proportionally focuses learning on more time-relevant datapoints. We theoretically analyze the convergence of NS-DPO in a general setting where the exact nature of the preference drift is not known, providing upper bounds on the estimation error and regret caused by non-stationary preferences. Finally, we demonstrate the effectiveness of NS-DPO for fine-tuning LLMs under drifting preferences. Using scenarios where various levels of preference drift is introduced, with popular LLM reward models and datasets, we show that NS-DPO fine-tuned LLMs remain robust under non-stationarity, significantly outperforming baseline algorithms that ignore temporal preference changes, without sacrificing performance in stationary cases.

APA

Son, S., Bankes, W., Ray Chowdhury, S., Paige, B. & Bogunovic, I.. (2025). Right Now, Wrong Then: Non-Stationary Direct Preference Optimization under Preference Drift. Proceedings of the 42nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 267:56173-56203 Available from https://proceedings.mlr.press/v267/son25a.html.

Right Now, Wrong Then: Non-Stationary Direct Preference Optimization under Preference Drift

Abstract

Cite this Paper

Related Material