POLITEX: Regret Bounds for Policy Iteration using Expert Prediction

Yasin Abbasi-Yadkori, Peter Bartlett, Kush Bhatia, Nevena Lazic, Csaba Szepesvari, Gellert Weisz
Proceedings of the 36th International Conference on Machine Learning, PMLR 97:3692-3702, 2019.

Abstract

We present POLITEX (POLicy ITeration with EXpert advice), a variant of policy iteration where each policy is a Boltzmann distribution over the sum of action-value function estimates of the previous policies, and analyze its regret in continuing RL problems. We assume that the value function error after running a policy for $\tau$ time steps scales as $\epsilon(\tau) = \epsilon_0 + O(\sqrt{d/\tau})$, where $\epsilon_0$ is the worst-case approximation error and $d$ is the number of features in a compressed representation of the state-action space. We establish that this condition is satisfied by the LSPE algorithm under certain assumptions on the MDP and policies. Under the error assumption, we show that the regret of POLITEX in uniformly mixing MDPs scales as $O(d^{1/2}T^{3/4} + \epsilon_0T)$, where $O(\cdot)$ hides logarithmic terms and problem-dependent constants. Thus, we provide the first regret bound for a fully practical model-free method which only scales in the number of features, and not in the size of the underlying MDP. Experiments on a queuing problem confirm that POLITEX is competitive with some of its alternatives, while preliminary results on Ms Pacman (one of the standard Atari benchmark problems) confirm the viability of POLITEX beyond linear function approximation.

Cite this Paper


BibTeX
@InProceedings{pmlr-v97-lazic19a, title = {{POLITEX}: Regret Bounds for Policy Iteration using Expert Prediction}, author = {Abbasi-Yadkori, Yasin and Bartlett, Peter and Bhatia, Kush and Lazic, Nevena and Szepesvari, Csaba and Weisz, Gellert}, booktitle = {Proceedings of the 36th International Conference on Machine Learning}, pages = {3692--3702}, year = {2019}, editor = {Chaudhuri, Kamalika and Salakhutdinov, Ruslan}, volume = {97}, series = {Proceedings of Machine Learning Research}, month = {09--15 Jun}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v97/lazic19a/lazic19a.pdf}, url = {https://proceedings.mlr.press/v97/lazic19a.html}, abstract = {We present POLITEX (POLicy ITeration with EXpert advice), a variant of policy iteration where each policy is a Boltzmann distribution over the sum of action-value function estimates of the previous policies, and analyze its regret in continuing RL problems. We assume that the value function error after running a policy for $\tau$ time steps scales as $\epsilon(\tau) = \epsilon_0 + O(\sqrt{d/\tau})$, where $\epsilon_0$ is the worst-case approximation error and $d$ is the number of features in a compressed representation of the state-action space. We establish that this condition is satisfied by the LSPE algorithm under certain assumptions on the MDP and policies. Under the error assumption, we show that the regret of POLITEX in uniformly mixing MDPs scales as $O(d^{1/2}T^{3/4} + \epsilon_0T)$, where $O(\cdot)$ hides logarithmic terms and problem-dependent constants. Thus, we provide the first regret bound for a fully practical model-free method which only scales in the number of features, and not in the size of the underlying MDP. Experiments on a queuing problem confirm that POLITEX is competitive with some of its alternatives, while preliminary results on Ms Pacman (one of the standard Atari benchmark problems) confirm the viability of POLITEX beyond linear function approximation.} }
Endnote
%0 Conference Paper %T POLITEX: Regret Bounds for Policy Iteration using Expert Prediction %A Yasin Abbasi-Yadkori %A Peter Bartlett %A Kush Bhatia %A Nevena Lazic %A Csaba Szepesvari %A Gellert Weisz %B Proceedings of the 36th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2019 %E Kamalika Chaudhuri %E Ruslan Salakhutdinov %F pmlr-v97-lazic19a %I PMLR %P 3692--3702 %U https://proceedings.mlr.press/v97/lazic19a.html %V 97 %X We present POLITEX (POLicy ITeration with EXpert advice), a variant of policy iteration where each policy is a Boltzmann distribution over the sum of action-value function estimates of the previous policies, and analyze its regret in continuing RL problems. We assume that the value function error after running a policy for $\tau$ time steps scales as $\epsilon(\tau) = \epsilon_0 + O(\sqrt{d/\tau})$, where $\epsilon_0$ is the worst-case approximation error and $d$ is the number of features in a compressed representation of the state-action space. We establish that this condition is satisfied by the LSPE algorithm under certain assumptions on the MDP and policies. Under the error assumption, we show that the regret of POLITEX in uniformly mixing MDPs scales as $O(d^{1/2}T^{3/4} + \epsilon_0T)$, where $O(\cdot)$ hides logarithmic terms and problem-dependent constants. Thus, we provide the first regret bound for a fully practical model-free method which only scales in the number of features, and not in the size of the underlying MDP. Experiments on a queuing problem confirm that POLITEX is competitive with some of its alternatives, while preliminary results on Ms Pacman (one of the standard Atari benchmark problems) confirm the viability of POLITEX beyond linear function approximation.
APA
Abbasi-Yadkori, Y., Bartlett, P., Bhatia, K., Lazic, N., Szepesvari, C. & Weisz, G.. (2019). POLITEX: Regret Bounds for Policy Iteration using Expert Prediction. Proceedings of the 36th International Conference on Machine Learning, in Proceedings of Machine Learning Research 97:3692-3702 Available from https://proceedings.mlr.press/v97/lazic19a.html.

Related Material