Minimax Regret Bounds for Reinforcement Learning

Mohammad Gheshlaghi Azar, Ian Osband, Rémi Munos
Proceedings of the 34th International Conference on Machine Learning, PMLR 70:263-272, 2017.

Abstract

We consider the problem of provably optimal exploration in reinforcement learning for finite horizon MDPs. We show that an optimistic modification to value iteration achieves a regret bound of $\tilde {O}( \sqrt{HSAT} + H^2S^2A+H\sqrt{T})$ where $H$ is the time horizon, $S$ the number of states, $A$ the number of actions and $T$ the number of time-steps. This result improves over the best previous known bound $\tilde {O}(HS \sqrt{AT})$ achieved by the UCRL2 algorithm. The key significance of our new results is that when $T\geq H^3S^3A$ and $SA\geq H$, it leads to a regret of $\tilde{O}(\sqrt{HSAT})$ that matches the established lower bound of $\Omega(\sqrt{HSAT})$ up to a logarithmic factor. Our analysis contain two key insights. We use careful application of concentration inequalities to the optimal value function as a whole, rather than to the transitions probabilities (to improve scaling in $S$), and we define Bernstein-based “exploration bonuses” that use the empirical variance of the estimated values at the next states (to improve scaling in $H$).

Cite this Paper


BibTeX
@InProceedings{pmlr-v70-azar17a, title = {Minimax Regret Bounds for Reinforcement Learning}, author = {Mohammad Gheshlaghi Azar and Ian Osband and R{\'e}mi Munos}, booktitle = {Proceedings of the 34th International Conference on Machine Learning}, pages = {263--272}, year = {2017}, editor = {Precup, Doina and Teh, Yee Whye}, volume = {70}, series = {Proceedings of Machine Learning Research}, month = {06--11 Aug}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v70/azar17a/azar17a.pdf}, url = {https://proceedings.mlr.press/v70/azar17a.html}, abstract = {We consider the problem of provably optimal exploration in reinforcement learning for finite horizon MDPs. We show that an optimistic modification to value iteration achieves a regret bound of $\tilde {O}( \sqrt{HSAT} + H^2S^2A+H\sqrt{T})$ where $H$ is the time horizon, $S$ the number of states, $A$ the number of actions and $T$ the number of time-steps. This result improves over the best previous known bound $\tilde {O}(HS \sqrt{AT})$ achieved by the UCRL2 algorithm. The key significance of our new results is that when $T\geq H^3S^3A$ and $SA\geq H$, it leads to a regret of $\tilde{O}(\sqrt{HSAT})$ that matches the established lower bound of $\Omega(\sqrt{HSAT})$ up to a logarithmic factor. Our analysis contain two key insights. We use careful application of concentration inequalities to the optimal value function as a whole, rather than to the transitions probabilities (to improve scaling in $S$), and we define Bernstein-based “exploration bonuses” that use the empirical variance of the estimated values at the next states (to improve scaling in $H$).} }
Endnote
%0 Conference Paper %T Minimax Regret Bounds for Reinforcement Learning %A Mohammad Gheshlaghi Azar %A Ian Osband %A Rémi Munos %B Proceedings of the 34th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2017 %E Doina Precup %E Yee Whye Teh %F pmlr-v70-azar17a %I PMLR %P 263--272 %U https://proceedings.mlr.press/v70/azar17a.html %V 70 %X We consider the problem of provably optimal exploration in reinforcement learning for finite horizon MDPs. We show that an optimistic modification to value iteration achieves a regret bound of $\tilde {O}( \sqrt{HSAT} + H^2S^2A+H\sqrt{T})$ where $H$ is the time horizon, $S$ the number of states, $A$ the number of actions and $T$ the number of time-steps. This result improves over the best previous known bound $\tilde {O}(HS \sqrt{AT})$ achieved by the UCRL2 algorithm. The key significance of our new results is that when $T\geq H^3S^3A$ and $SA\geq H$, it leads to a regret of $\tilde{O}(\sqrt{HSAT})$ that matches the established lower bound of $\Omega(\sqrt{HSAT})$ up to a logarithmic factor. Our analysis contain two key insights. We use careful application of concentration inequalities to the optimal value function as a whole, rather than to the transitions probabilities (to improve scaling in $S$), and we define Bernstein-based “exploration bonuses” that use the empirical variance of the estimated values at the next states (to improve scaling in $H$).
APA
Azar, M.G., Osband, I. & Munos, R.. (2017). Minimax Regret Bounds for Reinforcement Learning. Proceedings of the 34th International Conference on Machine Learning, in Proceedings of Machine Learning Research 70:263-272 Available from https://proceedings.mlr.press/v70/azar17a.html.

Related Material