Minimax Regret Bounds for Reinforcement Learning

Mohammad Gheshlaghi Azar; Ian Osband; Rémi Munos

Minimax Regret Bounds for Reinforcement Learning

Mohammad Gheshlaghi Azar, Ian Osband, Rémi Munos

Proceedings of the 34th International Conference on Machine Learning, PMLR 70:263-272, 2017.

Abstract

We consider the problem of provably optimal exploration in reinforcement learning for finite horizon MDPs. We show that an optimistic modification to value iteration achieves a regret bound of

$\tilde {O}( \sqrt{HSAT} + H^2S^2A+H\sqrt{T})$ where

$H$ is the time horizon,

$S$ the number of states,

$A$ the number of actions and

$T$ the number of time-steps. This result improves over the best previous known bound

$\tilde {O}(HS \sqrt{AT})$ achieved by the UCRL2 algorithm. The key significance of our new results is that when

$T\geq H^3S^3A$ and

$SA\geq H$ , it leads to a regret of

$\tilde{O}(\sqrt{HSAT})$ that matches the established lower bound of

$\Omega(\sqrt{HSAT})$ up to a logarithmic factor. Our analysis contain two key insights. We use careful application of concentration inequalities to the optimal value function as a whole, rather than to the transitions probabilities (to improve scaling in

$S$ ), and we define Bernstein-based “exploration bonuses” that use the empirical variance of the estimated values at the next states (to improve scaling in

$H$ ).

Cite this Paper

BibTeX


@InProceedings{pmlr-v70-azar17a,
  title = 	 {Minimax Regret Bounds for Reinforcement Learning},
  author =       {Mohammad Gheshlaghi Azar and Ian Osband and R{\'e}mi Munos},
  booktitle = 	 {Proceedings of the 34th International Conference on Machine Learning},
  pages = 	 {263--272},
  year = 	 {2017},
  editor = 	 {Precup, Doina and Teh, Yee Whye},
  volume = 	 {70},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {06--11 Aug},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v70/azar17a/azar17a.pdf},
  url = 	 {https://proceedings.mlr.press/v70/azar17a.html},
  abstract = 	 {We consider the problem of provably optimal exploration in reinforcement learning for finite horizon MDPs. We show that an optimistic modification to value iteration achieves a regret bound of $\tilde {O}( \sqrt{HSAT} + H^2S^2A+H\sqrt{T})$ where $H$ is the time horizon, $S$ the number of states, $A$ the number of actions and $T$ the number of time-steps. This result improves over the best previous known bound $\tilde {O}(HS \sqrt{AT})$ achieved by the UCRL2 algorithm. The key significance of our new results is that when $T\geq H^3S^3A$ and $SA\geq H$, it leads to a regret of $\tilde{O}(\sqrt{HSAT})$ that matches the established lower bound of $\Omega(\sqrt{HSAT})$ up to a logarithmic factor. Our analysis contain two key insights. We use careful application of concentration inequalities to the optimal value function as a whole, rather than to the transitions probabilities (to improve scaling in $S$), and we define Bernstein-based “exploration bonuses” that use the empirical variance of the estimated values at the next states (to improve scaling in $H$).}
}

Endnote

%0 Conference Paper
%T Minimax Regret Bounds for Reinforcement Learning
%A Mohammad Gheshlaghi Azar
%A Ian Osband
%A Rémi Munos
%B Proceedings of the 34th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2017
%E Doina Precup
%E Yee Whye Teh	
%F pmlr-v70-azar17a
%I PMLR
%P 263--272
%U https://proceedings.mlr.press/v70/azar17a.html
%V 70
%X We consider the problem of provably optimal exploration in reinforcement learning for finite horizon MDPs. We show that an optimistic modification to value iteration achieves a regret bound of $\tilde {O}( \sqrt{HSAT} + H^2S^2A+H\sqrt{T})$ where $H$ is the time horizon, $S$ the number of states, $A$ the number of actions and $T$ the number of time-steps. This result improves over the best previous known bound $\tilde {O}(HS \sqrt{AT})$ achieved by the UCRL2 algorithm. The key significance of our new results is that when $T\geq H^3S^3A$ and $SA\geq H$, it leads to a regret of $\tilde{O}(\sqrt{HSAT})$ that matches the established lower bound of $\Omega(\sqrt{HSAT})$ up to a logarithmic factor. Our analysis contain two key insights. We use careful application of concentration inequalities to the optimal value function as a whole, rather than to the transitions probabilities (to improve scaling in $S$), and we define Bernstein-based “exploration bonuses” that use the empirical variance of the estimated values at the next states (to improve scaling in $H$).

APA


Azar, M.G., Osband, I. & Munos, R.. (2017). Minimax Regret Bounds for Reinforcement Learning. Proceedings of the 34th International Conference on Machine Learning, in Proceedings of Machine Learning Research 70:263-272 Available from https://proceedings.mlr.press/v70/azar17a.html.

Minimax Regret Bounds for Reinforcement Learning

Abstract

Cite this Paper

Related Material