Online Learning in Markov Decision Processes with Changing Cost Sequences

Travis Dick; Andras Gyorgy; Csaba Szepesvari

Online Learning in Markov Decision Processes with Changing Cost Sequences

Travis Dick, Andras Gyorgy, Csaba Szepesvari

Proceedings of the 31st International Conference on Machine Learning, PMLR 32(1):512-520, 2014.

Abstract

In this paper we consider online learning in finite Markov decision processes (MDPs) with changing cost sequences under full and bandit-information. We propose to view this problem as an instance of online linear optimization. We propose two methods for this problem: MD^2 (mirror descent with approximate projections) and the continuous exponential weights algorithm with Dikin walks. We provide a rigorous complexity analysis of these techniques, while providing near-optimal regret-bounds (in particular, we take into account the computational costs of performing approximate projections in MD^2). In the case of full-information feedback, our results complement existing ones. In the case of bandit-information feedback we consider the online stochastic shortest path problem, a special case of the above MDP problems, and manage to improve the existing results by removing the previous restrictive assumption that the state-visitation probabilities are uniformly bounded away from zero under all policies.

Cite this Paper

BibTeX


@InProceedings{pmlr-v32-dick14,
  title = 	 {Online Learning in Markov Decision Processes with Changing Cost Sequences},
  author = 	 {Dick, Travis and Gyorgy, Andras and Szepesvari, Csaba},
  booktitle = 	 {Proceedings of the 31st International Conference on Machine Learning},
  pages = 	 {512--520},
  year = 	 {2014},
  editor = 	 {Xing, Eric P. and Jebara, Tony},
  volume = 	 {32},
  number =       {1},
  series = 	 {Proceedings of Machine Learning Research},
  address = 	 {Bejing, China},
  month = 	 {22--24 Jun},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v32/dick14.pdf},
  url = 	 {https://proceedings.mlr.press/v32/dick14.html},
  abstract = 	 {In this paper we consider online learning in finite Markov decision processes (MDPs) with changing cost sequences under full and bandit-information.  We propose to view this problem as an instance of online linear optimization.  We propose two methods for this problem: MD^2 (mirror descent with approximate projections) and the continuous exponential weights algorithm with Dikin walks.  We provide a rigorous complexity analysis of these techniques, while providing near-optimal regret-bounds (in particular, we take into account the computational costs of performing approximate projections in MD^2).  In the case of full-information feedback, our results complement existing ones. In the case of bandit-information feedback we consider the online stochastic shortest path problem, a special case of the above MDP problems, and manage to improve the existing results by removing the previous restrictive assumption that the state-visitation probabilities are uniformly bounded away from zero under all policies.}
}

Endnote

%0 Conference Paper
%T Online Learning in Markov Decision Processes with Changing Cost Sequences
%A Travis Dick
%A Andras Gyorgy
%A Csaba Szepesvari
%B Proceedings of the 31st International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2014
%E Eric P. Xing
%E Tony Jebara	
%F pmlr-v32-dick14
%I PMLR
%P 512--520
%U https://proceedings.mlr.press/v32/dick14.html
%V 32
%N 1
%X In this paper we consider online learning in finite Markov decision processes (MDPs) with changing cost sequences under full and bandit-information.  We propose to view this problem as an instance of online linear optimization.  We propose two methods for this problem: MD^2 (mirror descent with approximate projections) and the continuous exponential weights algorithm with Dikin walks.  We provide a rigorous complexity analysis of these techniques, while providing near-optimal regret-bounds (in particular, we take into account the computational costs of performing approximate projections in MD^2).  In the case of full-information feedback, our results complement existing ones. In the case of bandit-information feedback we consider the online stochastic shortest path problem, a special case of the above MDP problems, and manage to improve the existing results by removing the previous restrictive assumption that the state-visitation probabilities are uniformly bounded away from zero under all policies.

RIS


TY  - CPAPER
TI  - Online Learning in Markov Decision Processes with Changing Cost Sequences
AU  - Travis Dick
AU  - Andras Gyorgy
AU  - Csaba Szepesvari
BT  - Proceedings of the 31st International Conference on Machine Learning
DA  - 2014/01/27
ED  - Eric P. Xing
ED  - Tony Jebara	
ID  - pmlr-v32-dick14
PB  - PMLR
DP  - Proceedings of Machine Learning Research
VL  - 32
IS  - 1
SP  - 512
EP  - 520
L1  - http://proceedings.mlr.press/v32/dick14.pdf
UR  - https://proceedings.mlr.press/v32/dick14.html
AB  - In this paper we consider online learning in finite Markov decision processes (MDPs) with changing cost sequences under full and bandit-information.  We propose to view this problem as an instance of online linear optimization.  We propose two methods for this problem: MD^2 (mirror descent with approximate projections) and the continuous exponential weights algorithm with Dikin walks.  We provide a rigorous complexity analysis of these techniques, while providing near-optimal regret-bounds (in particular, we take into account the computational costs of performing approximate projections in MD^2).  In the case of full-information feedback, our results complement existing ones. In the case of bandit-information feedback we consider the online stochastic shortest path problem, a special case of the above MDP problems, and manage to improve the existing results by removing the previous restrictive assumption that the state-visitation probabilities are uniformly bounded away from zero under all policies.
ER  -

APA


Dick, T., Gyorgy, A. & Szepesvari, C.. (2014). Online Learning in Markov Decision Processes with Changing Cost Sequences. Proceedings of the 31st International Conference on Machine Learning, in Proceedings of Machine Learning Research 32(1):512-520 Available from https://proceedings.mlr.press/v32/dick14.html.

Online Learning in Markov Decision Processes with Changing Cost Sequences

Abstract

Cite this Paper

Related Material