Approximate Policy Iteration Schemes: A Comparison

Bruno Scherrer

Approximate Policy Iteration Schemes: A Comparison

Bruno Scherrer

Proceedings of the 31st International Conference on Machine Learning, PMLR 32(2):1314-1322, 2014.

Abstract

We consider the infinite-horizon discounted optimal control problem formalized by Markov Decision Processes. We focus on several approximate variations of the Policy Iteration algorithm: Approximate Policy Iteration, Conservative Policy Iteration (CPI), a natural adaptation of the Policy Search by Dynamic Programming algorithm to the infinite-horizon case (PSDP_∞), and the recently proposed Non-Stationary Policy iteration (NSPI(m)). For all algorithms, we describe performance bounds, and make a comparison by paying a particular attention to the concentrability constants involved, the number of iterations and the memory required. Our analysis highlights the following points: 1) The performance guarantee of CPI can be arbitrarily better than that of API/API(α), but this comes at the cost of a relative—exponential in \frac1ε—increase of the number of iterations. 2) PSDP_∞enjoys the best of both worlds: its performance guarantee is similar to that of CPI, but within a number of iterations similar to that of API. 3) Contrary to API that requires a constant memory, the memory needed by CPI and PSDP_∞is proportional to their number of iterations, which may be problematic when the discount factor γis close to 1 or the approximation error εis close to 0; we show that the NSPI(m) algorithm allows to make an overall trade-off between memory and performance. Simulations with these schemes confirm our analysis.

Cite this Paper

BibTeX


@InProceedings{pmlr-v32-scherrer14,
  title = 	 {Approximate Policy Iteration Schemes: A Comparison},
  author = 	 {Scherrer, Bruno},
  booktitle = 	 {Proceedings of the 31st International Conference on Machine Learning},
  pages = 	 {1314--1322},
  year = 	 {2014},
  editor = 	 {Xing, Eric P. and Jebara, Tony},
  volume = 	 {32},
  number =       {2},
  series = 	 {Proceedings of Machine Learning Research},
  address = 	 {Bejing, China},
  month = 	 {22--24 Jun},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v32/scherrer14.pdf},
  url = 	 {https://proceedings.mlr.press/v32/scherrer14.html},
  abstract = 	 {We consider the infinite-horizon discounted optimal control problem  formalized by Markov Decision Processes. We focus on several  approximate variations of the Policy Iteration algorithm: Approximate Policy Iteration, Conservative Policy Iteration  (CPI), a natural adaptation of the Policy Search by  Dynamic Programming algorithm to the  infinite-horizon case (PSDP_∞), and the recently proposed  Non-Stationary Policy iteration (NSPI(m)). For all  algorithms, we describe performance bounds, and  make a comparison by paying a particular attention to the  concentrability constants involved, the number of iterations and the  memory required. Our analysis highlights the following points: 1) The  performance guarantee of CPI can be arbitrarily better than that of  API/API(α), but this comes at the cost of a  relative—exponential in \frac1ε—increase of the  number of iterations. 2) PSDP_∞enjoys the best of both worlds: its performance guarantee is similar to that of CPI, but within a number of iterations similar to that of API. 3) Contrary to API that  requires a constant memory, the memory needed by CPI and PSDP_∞is  proportional to their number of iterations, which may be problematic  when the discount factor γis close to 1 or the  approximation error εis close to 0; we show that  the NSPI(m) algorithm allows to make an overall trade-off between  memory and performance. Simulations with these schemes confirm our  analysis.}
}

Endnote

%0 Conference Paper
%T Approximate Policy Iteration Schemes: A Comparison
%A Bruno Scherrer
%B Proceedings of the 31st International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2014
%E Eric P. Xing
%E Tony Jebara	
%F pmlr-v32-scherrer14
%I PMLR
%P 1314--1322
%U https://proceedings.mlr.press/v32/scherrer14.html
%V 32
%N 2
%X We consider the infinite-horizon discounted optimal control problem  formalized by Markov Decision Processes. We focus on several  approximate variations of the Policy Iteration algorithm: Approximate Policy Iteration, Conservative Policy Iteration  (CPI), a natural adaptation of the Policy Search by  Dynamic Programming algorithm to the  infinite-horizon case (PSDP_∞), and the recently proposed  Non-Stationary Policy iteration (NSPI(m)). For all  algorithms, we describe performance bounds, and  make a comparison by paying a particular attention to the  concentrability constants involved, the number of iterations and the  memory required. Our analysis highlights the following points: 1) The  performance guarantee of CPI can be arbitrarily better than that of  API/API(α), but this comes at the cost of a  relative—exponential in \frac1ε—increase of the  number of iterations. 2) PSDP_∞enjoys the best of both worlds: its performance guarantee is similar to that of CPI, but within a number of iterations similar to that of API. 3) Contrary to API that  requires a constant memory, the memory needed by CPI and PSDP_∞is  proportional to their number of iterations, which may be problematic  when the discount factor γis close to 1 or the  approximation error εis close to 0; we show that  the NSPI(m) algorithm allows to make an overall trade-off between  memory and performance. Simulations with these schemes confirm our  analysis.

RIS


TY  - CPAPER
TI  - Approximate Policy Iteration Schemes: A Comparison
AU  - Bruno Scherrer
BT  - Proceedings of the 31st International Conference on Machine Learning
DA  - 2014/06/18
ED  - Eric P. Xing
ED  - Tony Jebara	
ID  - pmlr-v32-scherrer14
PB  - PMLR
DP  - Proceedings of Machine Learning Research
VL  - 32
IS  - 2
SP  - 1314
EP  - 1322
L1  - http://proceedings.mlr.press/v32/scherrer14.pdf
UR  - https://proceedings.mlr.press/v32/scherrer14.html
AB  - We consider the infinite-horizon discounted optimal control problem  formalized by Markov Decision Processes. We focus on several  approximate variations of the Policy Iteration algorithm: Approximate Policy Iteration, Conservative Policy Iteration  (CPI), a natural adaptation of the Policy Search by  Dynamic Programming algorithm to the  infinite-horizon case (PSDP_∞), and the recently proposed  Non-Stationary Policy iteration (NSPI(m)). For all  algorithms, we describe performance bounds, and  make a comparison by paying a particular attention to the  concentrability constants involved, the number of iterations and the  memory required. Our analysis highlights the following points: 1) The  performance guarantee of CPI can be arbitrarily better than that of  API/API(α), but this comes at the cost of a  relative—exponential in \frac1ε—increase of the  number of iterations. 2) PSDP_∞enjoys the best of both worlds: its performance guarantee is similar to that of CPI, but within a number of iterations similar to that of API. 3) Contrary to API that  requires a constant memory, the memory needed by CPI and PSDP_∞is  proportional to their number of iterations, which may be problematic  when the discount factor γis close to 1 or the  approximation error εis close to 0; we show that  the NSPI(m) algorithm allows to make an overall trade-off between  memory and performance. Simulations with these schemes confirm our  analysis.
ER  -

APA


Scherrer, B.. (2014). Approximate Policy Iteration Schemes: A Comparison. Proceedings of the 31st International Conference on Machine Learning, in Proceedings of Machine Learning Research 32(2):1314-1322 Available from https://proceedings.mlr.press/v32/scherrer14.html.

Approximate Policy Iteration Schemes: A Comparison

Abstract

Cite this Paper

Related Material