Non-Stationary Approximate Modified Policy Iteration

Boris Lesner; Bruno Scherrer

Non-Stationary Approximate Modified Policy Iteration

Boris Lesner, Bruno Scherrer

Proceedings of the 32nd International Conference on Machine Learning, PMLR 37:1567-1575, 2015.

Abstract

We consider the infinite-horizon γ-discounted optimal control problem formalized by Markov Decision Processes. Running any instance of Modified Policy Iteration—a family of algorithms that can interpolate between Value and Policy Iteration—with an error εat each iteration is known to lead to stationary policies that are at least \frac2γε(1-γ)^2-optimal. Variations of Value and Policy Iteration, that build \ell-periodic non-stationary policies, have recently been shown to display a better \frac2γε(1-γ)(1-γ^\ell)-optimality guarantee. Our first contribution is to describe a new algorithmic scheme, Non-Stationary Modified Policy Iteration, a family of algorithms parameterized by two integers m \ge 0 and \ell \ge 1 that generalizes all the above mentionned algorithms. While m allows to interpolate between Value-Iteration-style and Policy-Iteration-style updates, \ell specifies the period of the non-stationary policy that is output. We show that this new family of algorithms also enjoys the improved \frac2γε(1-γ)(1-γ^\ell)-optimality guarantee. Perhaps more importantly, we show, by exhibiting an original problem instance, that this guarantee is tight for all m and \ell; this tightness was to our knowledge only proved two specific cases, Value Iteration (m=0,\ell=1) and Policy Iteration (m=∞,\ell=1).

Cite this Paper

BibTeX


@InProceedings{pmlr-v37-lesner15,
  title = 	 {Non-Stationary Approximate Modified Policy Iteration},
  author = 	 {Lesner, Boris and Scherrer, Bruno},
  booktitle = 	 {Proceedings of the 32nd International Conference on Machine Learning},
  pages = 	 {1567--1575},
  year = 	 {2015},
  editor = 	 {Bach, Francis and Blei, David},
  volume = 	 {37},
  series = 	 {Proceedings of Machine Learning Research},
  address = 	 {Lille, France},
  month = 	 {07--09 Jul},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v37/lesner15.pdf},
  url = 	 {https://proceedings.mlr.press/v37/lesner15.html},
  abstract = 	 {We consider the infinite-horizon γ-discounted optimal control problem formalized by Markov Decision Processes. Running any instance of Modified Policy Iteration—a family of algorithms that can interpolate between Value and Policy Iteration—with an error εat each iteration is known to lead to stationary policies that are at least \frac2γε(1-γ)^2-optimal. Variations of Value and Policy Iteration, that build \ell-periodic non-stationary policies, have recently been shown to display a better \frac2γε(1-γ)(1-γ^\ell)-optimality guarantee. Our first contribution is to describe a new algorithmic scheme, Non-Stationary Modified Policy Iteration, a family of algorithms parameterized by two integers m \ge 0 and \ell \ge 1 that generalizes all the above mentionned algorithms. While m allows to interpolate between Value-Iteration-style and Policy-Iteration-style updates, \ell specifies the period of the non-stationary policy that is output. We show that this new family of algorithms also enjoys the improved \frac2γε(1-γ)(1-γ^\ell)-optimality guarantee. Perhaps more importantly, we show, by exhibiting an original problem instance, that this guarantee is tight for all m and \ell; this tightness was to our knowledge only proved two specific cases, Value Iteration (m=0,\ell=1) and Policy Iteration (m=∞,\ell=1).}
}

Endnote

%0 Conference Paper
%T Non-Stationary Approximate Modified Policy Iteration
%A Boris Lesner
%A Bruno Scherrer
%B Proceedings of the 32nd International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2015
%E Francis Bach
%E David Blei	
%F pmlr-v37-lesner15
%I PMLR
%P 1567--1575
%U https://proceedings.mlr.press/v37/lesner15.html
%V 37
%X We consider the infinite-horizon γ-discounted optimal control problem formalized by Markov Decision Processes. Running any instance of Modified Policy Iteration—a family of algorithms that can interpolate between Value and Policy Iteration—with an error εat each iteration is known to lead to stationary policies that are at least \frac2γε(1-γ)^2-optimal. Variations of Value and Policy Iteration, that build \ell-periodic non-stationary policies, have recently been shown to display a better \frac2γε(1-γ)(1-γ^\ell)-optimality guarantee. Our first contribution is to describe a new algorithmic scheme, Non-Stationary Modified Policy Iteration, a family of algorithms parameterized by two integers m \ge 0 and \ell \ge 1 that generalizes all the above mentionned algorithms. While m allows to interpolate between Value-Iteration-style and Policy-Iteration-style updates, \ell specifies the period of the non-stationary policy that is output. We show that this new family of algorithms also enjoys the improved \frac2γε(1-γ)(1-γ^\ell)-optimality guarantee. Perhaps more importantly, we show, by exhibiting an original problem instance, that this guarantee is tight for all m and \ell; this tightness was to our knowledge only proved two specific cases, Value Iteration (m=0,\ell=1) and Policy Iteration (m=∞,\ell=1).

RIS


TY  - CPAPER
TI  - Non-Stationary Approximate Modified Policy Iteration
AU  - Boris Lesner
AU  - Bruno Scherrer
BT  - Proceedings of the 32nd International Conference on Machine Learning
DA  - 2015/06/01
ED  - Francis Bach
ED  - David Blei	
ID  - pmlr-v37-lesner15
PB  - PMLR
DP  - Proceedings of Machine Learning Research
VL  - 37
SP  - 1567
EP  - 1575
L1  - http://proceedings.mlr.press/v37/lesner15.pdf
UR  - https://proceedings.mlr.press/v37/lesner15.html
AB  - We consider the infinite-horizon γ-discounted optimal control problem formalized by Markov Decision Processes. Running any instance of Modified Policy Iteration—a family of algorithms that can interpolate between Value and Policy Iteration—with an error εat each iteration is known to lead to stationary policies that are at least \frac2γε(1-γ)^2-optimal. Variations of Value and Policy Iteration, that build \ell-periodic non-stationary policies, have recently been shown to display a better \frac2γε(1-γ)(1-γ^\ell)-optimality guarantee. Our first contribution is to describe a new algorithmic scheme, Non-Stationary Modified Policy Iteration, a family of algorithms parameterized by two integers m \ge 0 and \ell \ge 1 that generalizes all the above mentionned algorithms. While m allows to interpolate between Value-Iteration-style and Policy-Iteration-style updates, \ell specifies the period of the non-stationary policy that is output. We show that this new family of algorithms also enjoys the improved \frac2γε(1-γ)(1-γ^\ell)-optimality guarantee. Perhaps more importantly, we show, by exhibiting an original problem instance, that this guarantee is tight for all m and \ell; this tightness was to our knowledge only proved two specific cases, Value Iteration (m=0,\ell=1) and Policy Iteration (m=∞,\ell=1).
ER  -

APA


Lesner, B. & Scherrer, B.. (2015). Non-Stationary Approximate Modified Policy Iteration. Proceedings of the 32nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 37:1567-1575 Available from https://proceedings.mlr.press/v37/lesner15.html.

Non-Stationary Approximate Modified Policy Iteration

Abstract

Cite this Paper

Related Material