Gap-Dependent Unsupervised Exploration for Reinforcement Learning

Jingfeng Wu; Vladimir Braverman; Lin Yang

Gap-Dependent Unsupervised Exploration for Reinforcement Learning

Jingfeng Wu, Vladimir Braverman, Lin Yang

Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, PMLR 151:4109-4131, 2022.

Abstract

For the problem of task-agnostic reinforcement learning (RL), an agent first collects samples from an unknown environment without the supervision of reward signals, then is revealed with a reward and is asked to compute a corresponding near-optimal policy. Existing approaches mainly concern the worst-case scenarios, in which no structural information of the reward/transition-dynamics is utilized. Therefore the best sample upper bound is

$\propto\widetilde{\mathcal{O}}(1/\epsilon^2)$ , where

$\epsilon>0$ is the target accuracy of the obtained policy, and can be overly pessimistic. To tackle this issue, we provide an efficient algorithm that utilizes a gap parameter,

$\rho>0$ , to reduce the amount of exploration. In particular, for an unknown finite-horizon Markov decision process, the algorithm takes only

$\widetilde{\mathcal{O}} (1/\epsilon \cdot (H^3SA / \rho + H^4 S^2 A) )$ episodes of exploration, and is able to obtain an

$\epsilon$ -optimal policy for a post-revealed reward with sub-optimality gap at least

$\rho$ , where

$S$ is the number of states,

$A$ is the number of actions, and

$H$ is the length of the horizon, obtaining a nearly quadratic saving in terms of

$\epsilon$ . We show that, information-theoretically, this bound is nearly tight for

$\rho < \Theta(1/(HS))$ and

$H>1$ . We further show that

$\propto\widetilde{\mathcal{O}}(1)$ sample bound is possible for

$H=1$ (i.e., multi-armed bandit) or with a sampling simulator, establishing a stark separation between those settings and the RL setting.

Cite this Paper

BibTeX


@InProceedings{pmlr-v151-wu22b,
  title = 	 { Gap-Dependent Unsupervised Exploration for Reinforcement Learning },
  author =       {Wu, Jingfeng and Braverman, Vladimir and Yang, Lin},
  booktitle = 	 {Proceedings of The 25th International Conference on Artificial Intelligence and Statistics},
  pages = 	 {4109--4131},
  year = 	 {2022},
  editor = 	 {Camps-Valls, Gustau and Ruiz, Francisco J. R. and Valera, Isabel},
  volume = 	 {151},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {28--30 Mar},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v151/wu22b/wu22b.pdf},
  url = 	 {https://proceedings.mlr.press/v151/wu22b.html},
  abstract = 	 { For the problem of task-agnostic reinforcement learning (RL), an agent first collects samples from an unknown environment without the supervision of reward signals, then is revealed with a reward and is asked to compute a corresponding near-optimal policy. Existing approaches mainly concern the worst-case scenarios, in which no structural information of the reward/transition-dynamics is utilized. Therefore the best sample upper bound is $\propto\widetilde{\mathcal{O}}(1/\epsilon^2)$, where $\epsilon>0$ is the target accuracy of the obtained policy, and can be overly pessimistic. To tackle this issue, we provide an efficient algorithm that utilizes a gap parameter, $\rho>0$, to reduce the amount of exploration. In particular, for an unknown finite-horizon Markov decision process, the algorithm takes only $\widetilde{\mathcal{O}} (1/\epsilon \cdot (H^3SA / \rho + H^4 S^2 A) )$ episodes of exploration, and is able to obtain an $\epsilon$-optimal policy for a post-revealed reward with sub-optimality gap at least $\rho$, where $S$ is the number of states, $A$ is the number of actions, and $H$ is the length of the horizon, obtaining a nearly quadratic saving in terms of $\epsilon$. We show that, information-theoretically, this bound is nearly tight for $\rho < \Theta(1/(HS))$ and $H>1$. We further show that $\propto\widetilde{\mathcal{O}}(1)$ sample bound is possible for $H=1$ (i.e., multi-armed bandit) or with a sampling simulator, establishing a stark separation between those settings and the RL setting. }
}

Endnote

%0 Conference Paper
%T  Gap-Dependent Unsupervised Exploration for Reinforcement Learning 
%A Jingfeng Wu
%A Vladimir Braverman
%A Lin Yang
%B Proceedings of The 25th International Conference on Artificial Intelligence and Statistics
%C Proceedings of Machine Learning Research
%D 2022
%E Gustau Camps-Valls
%E Francisco J. R. Ruiz
%E Isabel Valera	
%F pmlr-v151-wu22b
%I PMLR
%P 4109--4131
%U https://proceedings.mlr.press/v151/wu22b.html
%V 151
%X  For the problem of task-agnostic reinforcement learning (RL), an agent first collects samples from an unknown environment without the supervision of reward signals, then is revealed with a reward and is asked to compute a corresponding near-optimal policy. Existing approaches mainly concern the worst-case scenarios, in which no structural information of the reward/transition-dynamics is utilized. Therefore the best sample upper bound is $\propto\widetilde{\mathcal{O}}(1/\epsilon^2)$, where $\epsilon>0$ is the target accuracy of the obtained policy, and can be overly pessimistic. To tackle this issue, we provide an efficient algorithm that utilizes a gap parameter, $\rho>0$, to reduce the amount of exploration. In particular, for an unknown finite-horizon Markov decision process, the algorithm takes only $\widetilde{\mathcal{O}} (1/\epsilon \cdot (H^3SA / \rho + H^4 S^2 A) )$ episodes of exploration, and is able to obtain an $\epsilon$-optimal policy for a post-revealed reward with sub-optimality gap at least $\rho$, where $S$ is the number of states, $A$ is the number of actions, and $H$ is the length of the horizon, obtaining a nearly quadratic saving in terms of $\epsilon$. We show that, information-theoretically, this bound is nearly tight for $\rho < \Theta(1/(HS))$ and $H>1$. We further show that $\propto\widetilde{\mathcal{O}}(1)$ sample bound is possible for $H=1$ (i.e., multi-armed bandit) or with a sampling simulator, establishing a stark separation between those settings and the RL setting.

APA


Wu, J., Braverman, V. & Yang, L.. (2022).  Gap-Dependent Unsupervised Exploration for Reinforcement Learning . Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 151:4109-4131 Available from https://proceedings.mlr.press/v151/wu22b.html.

Gap-Dependent Unsupervised Exploration for Reinforcement Learning

Abstract

Cite this Paper

Related Material