Fine-Grained Gap-Dependent Bounds for Tabular MDPs via Adaptive Multi-Step Bootstrap

Haike Xu; Tengyu Ma; Simon Du

Fine-Grained Gap-Dependent Bounds for Tabular MDPs via Adaptive Multi-Step Bootstrap

Haike Xu, Tengyu Ma, Simon Du

Proceedings of Thirty Fourth Conference on Learning Theory, PMLR 134:4438-4472, 2021.

Abstract

This paper presents a new model-free algorithm for episodic finite-horizon Markov Decision Processes (MDP), Adaptive Multi-step Bootstrap (AMB), which enjoys a stronger gap-dependent regret bound. The first innovation is to estimate the optimal $Q$-function by combining an optimistic bootstrap with an adaptive multi-step Monte Carlo rollout. The second innovation is to select the action with the largest confidence interval length among admissible actions that are not dominated by any other actions. We show when each state has a unique optimal action, AMB achieves a gap-dependent regret bound that only scales with the sum of the inverse of the sub-optimality gaps. In contrast, Simchowitz and Jamieson (2019) showed all upper-confidence-bound (UCB) algorithms suffer an additional $\Omega\left(\frac{S}{\Delta_{min}}\right)$ regret due to over-exploration where $\Delta_{min}$ is the minimum sub-optimality gap and $S$ is the number of states. We further show that for general MDPs, AMB suffers an additional $\frac{|Z_{mul}|}{\Delta_{min}}$ regret, where $Z_{mul}$ is the set of state-action pairs $(s,a)$s satisfying $a$ is a non-unique optimal action for $s$. We complement our upper bound with a lower bound showing the dependency on $\frac{|Z_{mul}|}{\Delta_{min}}$ is unavoidable for any consistent algorithm. This lower bound also implies a separation between reinforcement learning and contextual bandits.

Cite this Paper

BibTeX

@InProceedings{pmlr-v134-xu21a,
  title = 	 {Fine-Grained Gap-Dependent Bounds for Tabular MDPs via Adaptive Multi-Step Bootstrap},
  author =       {Xu, Haike and Ma, Tengyu and Du, Simon},
  booktitle = 	 {Proceedings of Thirty Fourth Conference on Learning Theory},
  pages = 	 {4438--4472},
  year = 	 {2021},
  editor = 	 {Belkin, Mikhail and Kpotufe, Samory},
  volume = 	 {134},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {15--19 Aug},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v134/xu21a/xu21a.pdf},
  url = 	 {https://proceedings.mlr.press/v134/xu21a.html},
  abstract = 	 {This paper presents a new model-free algorithm for episodic finite-horizon Markov Decision Processes (MDP), Adaptive Multi-step Bootstrap (AMB),  which enjoys a stronger gap-dependent regret bound. The first innovation is to estimate the optimal $Q$-function by combining an optimistic bootstrap with an adaptive multi-step Monte Carlo rollout. The second innovation is to select the action with the largest confidence interval length among admissible actions that are not dominated by any other actions. We show when each state has a unique optimal action, AMB achieves a gap-dependent regret bound that only scales with the sum of the inverse of the sub-optimality gaps. In contrast, Simchowitz and Jamieson (2019) showed all upper-confidence-bound (UCB) algorithms suffer an additional $\Omega\left(\frac{S}{\Delta_{min}}\right)$ regret due to over-exploration where $\Delta_{min}$ is the minimum sub-optimality gap and $S$ is the number of states. We further show that for general MDPs, AMB suffers an additional $\frac{|Z_{mul}|}{\Delta_{min}}$  regret, where $Z_{mul}$ is the set of state-action pairs $(s,a)$s satisfying $a$ is a non-unique optimal action for $s$. We complement our upper bound with a lower bound showing the dependency on $\frac{|Z_{mul}|}{\Delta_{min}}$ is unavoidable for any consistent algorithm. This lower bound also implies a separation between reinforcement learning and contextual bandits.}
}

Endnote

%0 Conference Paper
%T Fine-Grained Gap-Dependent Bounds for Tabular MDPs via Adaptive Multi-Step Bootstrap
%A Haike Xu
%A Tengyu Ma
%A Simon Du
%B Proceedings of Thirty Fourth Conference on Learning Theory
%C Proceedings of Machine Learning Research
%D 2021
%E Mikhail Belkin
%E Samory Kpotufe	
%F pmlr-v134-xu21a
%I PMLR
%P 4438--4472
%U https://proceedings.mlr.press/v134/xu21a.html
%V 134
%X This paper presents a new model-free algorithm for episodic finite-horizon Markov Decision Processes (MDP), Adaptive Multi-step Bootstrap (AMB),  which enjoys a stronger gap-dependent regret bound. The first innovation is to estimate the optimal $Q$-function by combining an optimistic bootstrap with an adaptive multi-step Monte Carlo rollout. The second innovation is to select the action with the largest confidence interval length among admissible actions that are not dominated by any other actions. We show when each state has a unique optimal action, AMB achieves a gap-dependent regret bound that only scales with the sum of the inverse of the sub-optimality gaps. In contrast, Simchowitz and Jamieson (2019) showed all upper-confidence-bound (UCB) algorithms suffer an additional $\Omega\left(\frac{S}{\Delta_{min}}\right)$ regret due to over-exploration where $\Delta_{min}$ is the minimum sub-optimality gap and $S$ is the number of states. We further show that for general MDPs, AMB suffers an additional $\frac{|Z_{mul}|}{\Delta_{min}}$  regret, where $Z_{mul}$ is the set of state-action pairs $(s,a)$s satisfying $a$ is a non-unique optimal action for $s$. We complement our upper bound with a lower bound showing the dependency on $\frac{|Z_{mul}|}{\Delta_{min}}$ is unavoidable for any consistent algorithm. This lower bound also implies a separation between reinforcement learning and contextual bandits.

APA

Xu, H., Ma, T. & Du, S.. (2021). Fine-Grained Gap-Dependent Bounds for Tabular MDPs via Adaptive Multi-Step Bootstrap. Proceedings of Thirty Fourth Conference on Learning Theory, in Proceedings of Machine Learning Research 134:4438-4472 Available from https://proceedings.mlr.press/v134/xu21a.html.

Related Material

Download PDF