Adaptive Reward-Free Exploration

Emilie Kaufmann; Pierre Ménard; Omar Darwiche Domingues; Anders Jonsson; Edouard Leurent; Michal Valko

Adaptive Reward-Free Exploration

Emilie Kaufmann, Pierre Ménard, Omar Darwiche Domingues, Anders Jonsson, Edouard Leurent, Michal Valko

Proceedings of the 32nd International Conference on Algorithmic Learning Theory, PMLR 132:865-891, 2021.

Abstract

Reward-free exploration is a reinforcement learning setting recently studied by (Jin et al. 2020), who address it by running several algorithms with regret guarantees in parallel. In our work, we instead propose a more natural adaptive approach for reward-free exploration which directly reduces upper bounds on the maximum MDP estimation error. We show that, interestingly, our reward-free UCRL algorithm can be seen as a variant of an algorithm by Fiechter from 1994, originally proposed for a different objective that we call best-policy identification. We prove that RF-UCRL needs of order (SAH^4/\epsilon^2)(log(1/\delta) + S) episodes to output, with probability 1-\delta, an \epsilon-approximation of the optimal policy for any reward function. This bound improves over existing sample complexity bounds in both the small \epsilon and the small \delta regimes. We further investigate the relative complexities of reward-free exploration and best policy identification.

Cite this Paper

BibTeX


@InProceedings{pmlr-v132-kaufmann21a,
  title = 	 {Adaptive Reward-Free Exploration},
  author =       {Kaufmann, Emilie and M{\'e}nard, Pierre and Darwiche Domingues, Omar and Jonsson, Anders and Leurent, Edouard and Valko, Michal},
  booktitle = 	 {Proceedings of the 32nd International Conference on Algorithmic Learning Theory},
  pages = 	 {865--891},
  year = 	 {2021},
  editor = 	 {Feldman, Vitaly and Ligett, Katrina and Sabato, Sivan},
  volume = 	 {132},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {16--19 Mar},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v132/kaufmann21a/kaufmann21a.pdf},
  url = 	 {https://proceedings.mlr.press/v132/kaufmann21a.html},
  abstract = 	 {Reward-free exploration is a reinforcement learning setting recently studied by (Jin et al. 2020), who address it by running several algorithms with regret guarantees in parallel. In our work, we instead propose a more natural adaptive approach for reward-free exploration which directly reduces upper bounds on the maximum MDP estimation error. We show that, interestingly, our reward-free UCRL algorithm can be seen as a variant of an algorithm by Fiechter from 1994, originally proposed for a different objective that we call best-policy identification. We prove that RF-UCRL needs of order (SAH^4/\epsilon^2)(log(1/\delta) + S) episodes to output, with probability 1-\delta, an \epsilon-approximation of the optimal policy for any reward function. This bound improves over existing sample complexity bounds in both the small \epsilon and the small \delta regimes. We further investigate the relative complexities of reward-free exploration and best policy identification. }
}

Endnote

%0 Conference Paper
%T Adaptive Reward-Free Exploration
%A Emilie Kaufmann
%A Pierre Ménard
%A Omar Darwiche Domingues
%A Anders Jonsson
%A Edouard Leurent
%A Michal Valko
%B Proceedings of the 32nd International Conference on Algorithmic Learning Theory
%C Proceedings of Machine Learning Research
%D 2021
%E Vitaly Feldman
%E Katrina Ligett
%E Sivan Sabato	
%F pmlr-v132-kaufmann21a
%I PMLR
%P 865--891
%U https://proceedings.mlr.press/v132/kaufmann21a.html
%V 132
%X Reward-free exploration is a reinforcement learning setting recently studied by (Jin et al. 2020), who address it by running several algorithms with regret guarantees in parallel. In our work, we instead propose a more natural adaptive approach for reward-free exploration which directly reduces upper bounds on the maximum MDP estimation error. We show that, interestingly, our reward-free UCRL algorithm can be seen as a variant of an algorithm by Fiechter from 1994, originally proposed for a different objective that we call best-policy identification. We prove that RF-UCRL needs of order (SAH^4/\epsilon^2)(log(1/\delta) + S) episodes to output, with probability 1-\delta, an \epsilon-approximation of the optimal policy for any reward function. This bound improves over existing sample complexity bounds in both the small \epsilon and the small \delta regimes. We further investigate the relative complexities of reward-free exploration and best policy identification.

APA


Kaufmann, E., Ménard, P., Darwiche Domingues, O., Jonsson, A., Leurent, E. & Valko, M.. (2021). Adaptive Reward-Free Exploration. Proceedings of the 32nd International Conference on Algorithmic Learning Theory, in Proceedings of Machine Learning Research 132:865-891 Available from https://proceedings.mlr.press/v132/kaufmann21a.html.

Related Material

Download PDF