Approximate Policy Iteration Schemes: A Comparison

Bruno Scherrer
Proceedings of the 31st International Conference on Machine Learning, PMLR 32(2):1314-1322, 2014.

Abstract

We consider the infinite-horizon discounted optimal control problem formalized by Markov Decision Processes. We focus on several approximate variations of the Policy Iteration algorithm: Approximate Policy Iteration, Conservative Policy Iteration (CPI), a natural adaptation of the Policy Search by Dynamic Programming algorithm to the infinite-horizon case (PSDP_∞), and the recently proposed Non-Stationary Policy iteration (NSPI(m)). For all algorithms, we describe performance bounds, and make a comparison by paying a particular attention to the concentrability constants involved, the number of iterations and the memory required. Our analysis highlights the following points: 1) The performance guarantee of CPI can be arbitrarily better than that of API/API(α), but this comes at the cost of a relative—exponential in \frac1ε—increase of the number of iterations. 2) PSDP_∞enjoys the best of both worlds: its performance guarantee is similar to that of CPI, but within a number of iterations similar to that of API. 3) Contrary to API that requires a constant memory, the memory needed by CPI and PSDP_∞is proportional to their number of iterations, which may be problematic when the discount factor γis close to 1 or the approximation error εis close to 0; we show that the NSPI(m) algorithm allows to make an overall trade-off between memory and performance. Simulations with these schemes confirm our analysis.

Cite this Paper


BibTeX
@InProceedings{pmlr-v32-scherrer14, title = {Approximate Policy Iteration Schemes: A Comparison}, author = {Scherrer, Bruno}, booktitle = {Proceedings of the 31st International Conference on Machine Learning}, pages = {1314--1322}, year = {2014}, editor = {Xing, Eric P. and Jebara, Tony}, volume = {32}, number = {2}, series = {Proceedings of Machine Learning Research}, address = {Bejing, China}, month = {22--24 Jun}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v32/scherrer14.pdf}, url = {https://proceedings.mlr.press/v32/scherrer14.html}, abstract = {We consider the infinite-horizon discounted optimal control problem formalized by Markov Decision Processes. We focus on several approximate variations of the Policy Iteration algorithm: Approximate Policy Iteration, Conservative Policy Iteration (CPI), a natural adaptation of the Policy Search by Dynamic Programming algorithm to the infinite-horizon case (PSDP_∞), and the recently proposed Non-Stationary Policy iteration (NSPI(m)). For all algorithms, we describe performance bounds, and make a comparison by paying a particular attention to the concentrability constants involved, the number of iterations and the memory required. Our analysis highlights the following points: 1) The performance guarantee of CPI can be arbitrarily better than that of API/API(α), but this comes at the cost of a relative—exponential in \frac1ε—increase of the number of iterations. 2) PSDP_∞enjoys the best of both worlds: its performance guarantee is similar to that of CPI, but within a number of iterations similar to that of API. 3) Contrary to API that requires a constant memory, the memory needed by CPI and PSDP_∞is proportional to their number of iterations, which may be problematic when the discount factor γis close to 1 or the approximation error εis close to 0; we show that the NSPI(m) algorithm allows to make an overall trade-off between memory and performance. Simulations with these schemes confirm our analysis.} }
Endnote
%0 Conference Paper %T Approximate Policy Iteration Schemes: A Comparison %A Bruno Scherrer %B Proceedings of the 31st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2014 %E Eric P. Xing %E Tony Jebara %F pmlr-v32-scherrer14 %I PMLR %P 1314--1322 %U https://proceedings.mlr.press/v32/scherrer14.html %V 32 %N 2 %X We consider the infinite-horizon discounted optimal control problem formalized by Markov Decision Processes. We focus on several approximate variations of the Policy Iteration algorithm: Approximate Policy Iteration, Conservative Policy Iteration (CPI), a natural adaptation of the Policy Search by Dynamic Programming algorithm to the infinite-horizon case (PSDP_∞), and the recently proposed Non-Stationary Policy iteration (NSPI(m)). For all algorithms, we describe performance bounds, and make a comparison by paying a particular attention to the concentrability constants involved, the number of iterations and the memory required. Our analysis highlights the following points: 1) The performance guarantee of CPI can be arbitrarily better than that of API/API(α), but this comes at the cost of a relative—exponential in \frac1ε—increase of the number of iterations. 2) PSDP_∞enjoys the best of both worlds: its performance guarantee is similar to that of CPI, but within a number of iterations similar to that of API. 3) Contrary to API that requires a constant memory, the memory needed by CPI and PSDP_∞is proportional to their number of iterations, which may be problematic when the discount factor γis close to 1 or the approximation error εis close to 0; we show that the NSPI(m) algorithm allows to make an overall trade-off between memory and performance. Simulations with these schemes confirm our analysis.
RIS
TY - CPAPER TI - Approximate Policy Iteration Schemes: A Comparison AU - Bruno Scherrer BT - Proceedings of the 31st International Conference on Machine Learning DA - 2014/06/18 ED - Eric P. Xing ED - Tony Jebara ID - pmlr-v32-scherrer14 PB - PMLR DP - Proceedings of Machine Learning Research VL - 32 IS - 2 SP - 1314 EP - 1322 L1 - http://proceedings.mlr.press/v32/scherrer14.pdf UR - https://proceedings.mlr.press/v32/scherrer14.html AB - We consider the infinite-horizon discounted optimal control problem formalized by Markov Decision Processes. We focus on several approximate variations of the Policy Iteration algorithm: Approximate Policy Iteration, Conservative Policy Iteration (CPI), a natural adaptation of the Policy Search by Dynamic Programming algorithm to the infinite-horizon case (PSDP_∞), and the recently proposed Non-Stationary Policy iteration (NSPI(m)). For all algorithms, we describe performance bounds, and make a comparison by paying a particular attention to the concentrability constants involved, the number of iterations and the memory required. Our analysis highlights the following points: 1) The performance guarantee of CPI can be arbitrarily better than that of API/API(α), but this comes at the cost of a relative—exponential in \frac1ε—increase of the number of iterations. 2) PSDP_∞enjoys the best of both worlds: its performance guarantee is similar to that of CPI, but within a number of iterations similar to that of API. 3) Contrary to API that requires a constant memory, the memory needed by CPI and PSDP_∞is proportional to their number of iterations, which may be problematic when the discount factor γis close to 1 or the approximation error εis close to 0; we show that the NSPI(m) algorithm allows to make an overall trade-off between memory and performance. Simulations with these schemes confirm our analysis. ER -
APA
Scherrer, B.. (2014). Approximate Policy Iteration Schemes: A Comparison. Proceedings of the 31st International Conference on Machine Learning, in Proceedings of Machine Learning Research 32(2):1314-1322 Available from https://proceedings.mlr.press/v32/scherrer14.html.

Related Material