Episodic Reinforcement Learning in Finite MDPs: Minimax Lower Bounds Revisited

Omar Darwiche Domingues; Pierre Ménard; Emilie Kaufmann; Michal Valko

Episodic Reinforcement Learning in Finite MDPs: Minimax Lower Bounds Revisited

Omar Darwiche Domingues, Pierre Ménard, Emilie Kaufmann, Michal Valko

Proceedings of the 32nd International Conference on Algorithmic Learning Theory, PMLR 132:578-598, 2021.

Abstract

In this paper, we propose new problem-independent lower bounds on the sample complexity and regret in episodic MDPs, with a particular focus on the \emph{non-stationary case} in which the transition kernel is allowed to change in each stage of the episode. Our main contribution is a lower bound of

$\Omega((H^3SA/\epsilon^2)\log(1/\delta))$ on the sample complexity of an

$(\varepsilon,\delta)$ -PAC algorithm for best policy identification in a non-stationary MDP, relying on a construction of “hard MDPs” which is different from the ones previously used in the literature. Using this same class of MDPs, we also provide a rigorous proof of the

$\Omega(\sqrt{H^3SAT})$ regret bound for non-stationary MDPs. Finally, we discuss connections to PAC-MDP lower bounds.

Cite this Paper

BibTeX


@InProceedings{pmlr-v132-domingues21a,
  title = 	 {Episodic Reinforcement Learning in Finite MDPs: Minimax Lower Bounds Revisited},
  author =       {Domingues, Omar Darwiche and M{\'e}nard, Pierre and Kaufmann, Emilie and Valko, Michal},
  booktitle = 	 {Proceedings of the 32nd International Conference on Algorithmic Learning Theory},
  pages = 	 {578--598},
  year = 	 {2021},
  editor = 	 {Feldman, Vitaly and Ligett, Katrina and Sabato, Sivan},
  volume = 	 {132},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {16--19 Mar},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v132/domingues21a/domingues21a.pdf},
  url = 	 {https://proceedings.mlr.press/v132/domingues21a.html},
  abstract = 	 {In this paper, we propose new problem-independent lower bounds on the sample complexity and regret in episodic MDPs, with a particular focus on the \emph{non-stationary case} in which the transition kernel is allowed to change in each stage of the episode. Our main contribution is a lower bound of $\Omega((H^3SA/\epsilon^2)\log(1/\delta))$ on the sample complexity of an $(\varepsilon,\delta)$-PAC algorithm for best policy identification in a non-stationary MDP, relying on a construction of “hard MDPs” which is different from the ones previously used in the literature. Using this same class of MDPs, we also provide a rigorous proof of the $\Omega(\sqrt{H^3SAT})$ regret bound for non-stationary MDPs. Finally, we discuss connections to PAC-MDP lower bounds.}
}

Endnote

%0 Conference Paper
%T Episodic Reinforcement Learning in Finite MDPs: Minimax Lower Bounds Revisited
%A Omar Darwiche Domingues
%A Pierre Ménard
%A Emilie Kaufmann
%A Michal Valko
%B Proceedings of the 32nd International Conference on Algorithmic Learning Theory
%C Proceedings of Machine Learning Research
%D 2021
%E Vitaly Feldman
%E Katrina Ligett
%E Sivan Sabato	
%F pmlr-v132-domingues21a
%I PMLR
%P 578--598
%U https://proceedings.mlr.press/v132/domingues21a.html
%V 132
%X In this paper, we propose new problem-independent lower bounds on the sample complexity and regret in episodic MDPs, with a particular focus on the \emph{non-stationary case} in which the transition kernel is allowed to change in each stage of the episode. Our main contribution is a lower bound of $\Omega((H^3SA/\epsilon^2)\log(1/\delta))$ on the sample complexity of an $(\varepsilon,\delta)$-PAC algorithm for best policy identification in a non-stationary MDP, relying on a construction of “hard MDPs” which is different from the ones previously used in the literature. Using this same class of MDPs, we also provide a rigorous proof of the $\Omega(\sqrt{H^3SAT})$ regret bound for non-stationary MDPs. Finally, we discuss connections to PAC-MDP lower bounds.

APA


Domingues, O.D., Ménard, P., Kaufmann, E. & Valko, M.. (2021). Episodic Reinforcement Learning in Finite MDPs: Minimax Lower Bounds Revisited. Proceedings of the 32nd International Conference on Algorithmic Learning Theory, in Proceedings of Machine Learning Research 132:578-598 Available from https://proceedings.mlr.press/v132/domingues21a.html.

Related Material

Download PDF