POLITEX: Regret Bounds for Policy Iteration using Expert Prediction
[edit]
Proceedings of the 36th International Conference on Machine Learning, PMLR 97:36923702, 2019.
Abstract
We present POLITEX (POLicy ITeration with EXpert advice), a variant of policy iteration where each policy is a Boltzmann distribution over the sum of actionvalue function estimates of the previous policies, and analyze its regret in continuing RL problems. We assume that the value function error after running a policy for $\tau$ time steps scales as $\epsilon(\tau) = \epsilon_0 + O(\sqrt{d/\tau})$, where $\epsilon_0$ is the worstcase approximation error and $d$ is the number of features in a compressed representation of the stateaction space. We establish that this condition is satisfied by the LSPE algorithm under certain assumptions on the MDP and policies. Under the error assumption, we show that the regret of POLITEX in uniformly mixing MDPs scales as $O(d^{1/2}T^{3/4} + \epsilon_0T)$, where $O(\cdot)$ hides logarithmic terms and problemdependent constants. Thus, we provide the first regret bound for a fully practical modelfree method which only scales in the number of features, and not in the size of the underlying MDP. Experiments on a queuing problem confirm that POLITEX is competitive with some of its alternatives, while preliminary results on Ms Pacman (one of the standard Atari benchmark problems) confirm the viability of POLITEX beyond linear function approximation.
Related Material


