Restarted Bayesian Online Change-point Detection for Non-Stationary Markov Decision Processes

Reda Alami; Mohammed Mahfoud; Eric Moulines

Restarted Bayesian Online Change-point Detection for Non-Stationary Markov Decision Processes

Reda Alami, Mohammed Mahfoud, Eric Moulines

Proceedings of The 2nd Conference on Lifelong Learning Agents, PMLR 232:715-744, 2023.

Abstract

We consider the problem of learning in a non-stationary reinforcement learning (RL) environment, where the setting can be fully described by a piecewise stationary discrete-time Markov decision process (MDP). We introduce a variant of the Restarted Bayesian Online Change-Point Detection algorithm (R-BOCPD) that operates on input streams originating from the more general multinomial distribution and provides near-optimal theoretical guarantees in terms of false-alarm rate and detection delay. Based on this, we propose an improved version of the UCRL2 algorithm for MDPs with state transition kernel sampled from a multinomial distribution, which we call R-BOCPD-UCRL2. We perform a finite-time performance analysis and show that R-BOCPD-UCRL2 enjoys a favorable regret bound of

$\mathcal{O}\left(D O \sqrt{A T K_T \log\left (\frac{T}{\delta} \right)} + \frac{K_T \log \frac{K_T}{\delta}}{\min\limits_\ell \:{KL}(\boldsymbol{\theta}^{(\ell+1)},\boldsymbol{\theta}^{(\ell)})} \right)$ , where

$D$ is the largest MDP diameter from the set of MDPs defining the piecewise stationary MDP setting,

$O$ is the finite number of states (constant over all changes),

$A$ is the finite number of actions (constant over all changes),

$K_T$ is the number of change points, and

$\boldsymbol{\theta}^{(\ell)}$ is the transition kernel during the interval

$[c_\ell, c_{\ell+1})$ , which we assume to be multinomially distributed over the set of states

$\mathsf{O}$ . Interestingly, the performance bound does not directly scale with the variation in MDP state transition distributions and rewards, ie. can also model abrupt changes. In practice, R-BOCPD-UCRL2 outperforms the state-of-the-art in a variety of scenarios in synthetic environments. We provide a detailed experimental setup along with a code repository (upon publication) that can be used to easily reproduce our experiments.

Cite this Paper

BibTeX


@InProceedings{pmlr-v232-alami23a,
  title = 	 {Restarted Bayesian Online Change-point Detection for Non-Stationary Markov Decision Processes},
  author =       {Alami, Reda and Mahfoud, Mohammed and Moulines, Eric},
  booktitle = 	 {Proceedings of The 2nd Conference on Lifelong Learning Agents},
  pages = 	 {715--744},
  year = 	 {2023},
  editor = 	 {Chandar, Sarath and Pascanu, Razvan and Sedghi, Hanie and Precup, Doina},
  volume = 	 {232},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {22--25 Aug},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v232/alami23a/alami23a.pdf},
  url = 	 {https://proceedings.mlr.press/v232/alami23a.html},
  abstract = 	 {We consider the problem of learning in a non-stationary reinforcement learning (RL) environment, where the setting can be fully described by a piecewise stationary discrete-time Markov decision process (MDP). We introduce a variant of the Restarted Bayesian Online Change-Point Detection algorithm (R-BOCPD) that operates on input streams originating from the more general multinomial distribution and provides near-optimal theoretical guarantees in terms of false-alarm rate and detection delay. Based on this, we propose an improved version of the UCRL2 algorithm for MDPs with state transition kernel sampled from a multinomial distribution, which we call R-BOCPD-UCRL2. We perform a finite-time performance analysis and show that R-BOCPD-UCRL2 enjoys a favorable regret bound of $\mathcal{O}\left(D O \sqrt{A T K_T \log\left (\frac{T}{\delta} \right)} + \frac{K_T \log \frac{K_T}{\delta}}{\min\limits_\ell \:{KL}(\boldsymbol{\theta}^{(\ell+1)},\boldsymbol{\theta}^{(\ell)})} \right)$, where $D$ is the largest MDP diameter from the set of MDPs defining the piecewise stationary MDP setting, $O$ is the finite number of states (constant over all changes), $A$ is the finite number of actions (constant over all changes), $K_T$ is the number of change points, and $\boldsymbol{\theta}^{(\ell)}$ is the transition kernel during the interval $[c_\ell, c_{\ell+1})$, which we assume to be multinomially distributed over the set of states $\mathsf{O}$. Interestingly, the performance bound does not directly scale with the variation in MDP state transition distributions and rewards, ie. can also model abrupt changes.  In practice, R-BOCPD-UCRL2 outperforms the state-of-the-art in a variety of scenarios in synthetic environments. We provide a detailed experimental setup along with a code repository (upon publication) that can be used to easily reproduce our experiments.}
}

Endnote

%0 Conference Paper
%T Restarted Bayesian Online Change-point Detection for Non-Stationary Markov Decision Processes
%A Reda Alami
%A Mohammed Mahfoud
%A Eric Moulines
%B Proceedings of The 2nd Conference on Lifelong Learning Agents
%C Proceedings of Machine Learning Research
%D 2023
%E Sarath Chandar
%E Razvan Pascanu
%E Hanie Sedghi
%E Doina Precup	
%F pmlr-v232-alami23a
%I PMLR
%P 715--744
%U https://proceedings.mlr.press/v232/alami23a.html
%V 232
%X We consider the problem of learning in a non-stationary reinforcement learning (RL) environment, where the setting can be fully described by a piecewise stationary discrete-time Markov decision process (MDP). We introduce a variant of the Restarted Bayesian Online Change-Point Detection algorithm (R-BOCPD) that operates on input streams originating from the more general multinomial distribution and provides near-optimal theoretical guarantees in terms of false-alarm rate and detection delay. Based on this, we propose an improved version of the UCRL2 algorithm for MDPs with state transition kernel sampled from a multinomial distribution, which we call R-BOCPD-UCRL2. We perform a finite-time performance analysis and show that R-BOCPD-UCRL2 enjoys a favorable regret bound of $\mathcal{O}\left(D O \sqrt{A T K_T \log\left (\frac{T}{\delta} \right)} + \frac{K_T \log \frac{K_T}{\delta}}{\min\limits_\ell \:{KL}(\boldsymbol{\theta}^{(\ell+1)},\boldsymbol{\theta}^{(\ell)})} \right)$, where $D$ is the largest MDP diameter from the set of MDPs defining the piecewise stationary MDP setting, $O$ is the finite number of states (constant over all changes), $A$ is the finite number of actions (constant over all changes), $K_T$ is the number of change points, and $\boldsymbol{\theta}^{(\ell)}$ is the transition kernel during the interval $[c_\ell, c_{\ell+1})$, which we assume to be multinomially distributed over the set of states $\mathsf{O}$. Interestingly, the performance bound does not directly scale with the variation in MDP state transition distributions and rewards, ie. can also model abrupt changes.  In practice, R-BOCPD-UCRL2 outperforms the state-of-the-art in a variety of scenarios in synthetic environments. We provide a detailed experimental setup along with a code repository (upon publication) that can be used to easily reproduce our experiments.

APA


Alami, R., Mahfoud, M. & Moulines, E.. (2023). Restarted Bayesian Online Change-point Detection for Non-Stationary Markov Decision Processes. Proceedings of The 2nd Conference on Lifelong Learning Agents, in Proceedings of Machine Learning Research 232:715-744 Available from https://proceedings.mlr.press/v232/alami23a.html.

Restarted Bayesian Online Change-point Detection for Non-Stationary Markov Decision Processes

Abstract

Cite this Paper

Related Material