NonStationary Approximate Modified Policy Iteration
[edit]
Proceedings of the 32nd International Conference on Machine Learning, PMLR 37:15671575, 2015.
Abstract
We consider the infinitehorizon γdiscounted optimal control problem formalized by Markov Decision Processes. Running any instance of Modified Policy Iteration—a family of algorithms that can interpolate between Value and Policy Iteration—with an error εat each iteration is known to lead to stationary policies that are at least \frac2γε(1γ)^2optimal. Variations of Value and Policy Iteration, that build \ellperiodic nonstationary policies, have recently been shown to display a better \frac2γε(1γ)(1γ^\ell)optimality guarantee. Our first contribution is to describe a new algorithmic scheme, NonStationary Modified Policy Iteration, a family of algorithms parameterized by two integers m \ge 0 and \ell \ge 1 that generalizes all the above mentionned algorithms. While m allows to interpolate between ValueIterationstyle and PolicyIterationstyle updates, \ell specifies the period of the nonstationary policy that is output. We show that this new family of algorithms also enjoys the improved \frac2γε(1γ)(1γ^\ell)optimality guarantee. Perhaps more importantly, we show, by exhibiting an original problem instance, that this guarantee is tight for all m and \ell; this tightness was to our knowledge only proved two specific cases, Value Iteration (m=0,\ell=1) and Policy Iteration (m=∞,\ell=1).
Related Material



