Settling the sample complexity of online reinforcement learning

Zihan Zhang; Yuxin Chen; Jason D Lee; Simon S Du

Settling the sample complexity of online reinforcement learning

Zihan Zhang, Yuxin Chen, Jason D Lee, Simon S Du

Proceedings of Thirty Seventh Conference on Learning Theory, PMLR 247:5213-5219, 2024.

Abstract

A central issue lying at the heart of online reinforcement learning (RL) is data efficiency. While a number of recent works achieved asymptotically minimal regret in online RL, the optimality of these results is only guaranteed in a “large-sample” regime, imposing enormous burn-in cost in order for their algorithms to operate optimally. How to achieve minimax-optimal regret without incurring any burn-in cost has been an open problem in RL theory. We settle this problem for finite-horizon inhomogeneous Markov decision processes. Specifically, we prove that a modified version of MVP (Monotonic Value Propagation), an optimistic model-based algorithm proposed by Zhang et al., achieves a regret on the order of

$\min\big\{ \sqrt{SAH^3K}, \,HK \big\},$ where

$S$ is the number of states,

$A$ is the number of actions,

$H$ is the horizon length, and

$K$ is the total number of episodes. This regret matches the minimax lower bound for the entire range of sample size K, essentially eliminating any burn-in requirement. It also translates to a PAC sample complexity (i.e., the number of episodes needed to yield

$\varepsilon$ -accuracy) of

$\frac{SAH^3}{\varepsilon^2}$ up to log factor, which is minimax-optimal for the full epsilon-range. Further, we extend our theory to unveil the influences of problem-dependent quantities like the optimal value/cost and certain variances. The key technical innovation lies in a novel analysis paradigm to decouple complicated statistical dependency — a long-standing challenge facing the analysis of online RL in the sample-hungry regime.

Cite this Paper

BibTeX


@InProceedings{pmlr-v247-zhang24a,
  title = 	 {Settling the sample complexity of online reinforcement learning},
  author =       {Zhang, Zihan and Chen, Yuxin and Lee, Jason D and Du, Simon S},
  booktitle = 	 {Proceedings of Thirty Seventh Conference on Learning Theory},
  pages = 	 {5213--5219},
  year = 	 {2024},
  editor = 	 {Agrawal, Shipra and Roth, Aaron},
  volume = 	 {247},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {30 Jun--03 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v247/zhang24a/zhang24a.pdf},
  url = 	 {https://proceedings.mlr.press/v247/zhang24a.html},
  abstract = 	 {A central issue lying at the heart of online reinforcement learning (RL) is data efficiency.  While a number of recent works achieved asymptotically minimal regret in online RL, the optimality of these results is only guaranteed in a “large-sample” regime, imposing enormous burn-in cost in order for their algorithms to operate optimally. How to achieve minimax-optimal regret without incurring any burn-in cost has been an open problem in RL theory.   We settle this problem for finite-horizon inhomogeneous Markov decision processes. Specifically, we prove that a modified version of MVP (Monotonic Value Propagation), an optimistic model-based algorithm proposed by Zhang et al., achieves a regret on the order of $$\min\big\{  \sqrt{SAH^3K}, \,HK \big\},$$ where $S$ is the number of states, $A$ is the number of actions, $H$ is the horizon length, and $K$ is the total number of episodes. This regret matches the minimax lower bound for the entire range of sample size K, essentially eliminating any burn-in requirement. It also translates to a PAC sample complexity (i.e., the number of episodes needed to yield $\varepsilon$-accuracy) of $\frac{SAH^3}{\varepsilon^2}$ up to log factor, which is minimax-optimal for the full epsilon-range. Further, we extend our theory to unveil the influences of problem-dependent quantities like the optimal value/cost and certain variances. The key technical innovation lies in a novel analysis paradigm to decouple complicated statistical dependency — a long-standing challenge facing the analysis of online RL in the sample-hungry regime.   }
}

Endnote

%0 Conference Paper
%T Settling the sample complexity of online reinforcement learning
%A Zihan Zhang
%A Yuxin Chen
%A Jason D Lee
%A Simon S Du
%B Proceedings of Thirty Seventh Conference on Learning Theory
%C Proceedings of Machine Learning Research
%D 2024
%E Shipra Agrawal
%E Aaron Roth	
%F pmlr-v247-zhang24a
%I PMLR
%P 5213--5219
%U https://proceedings.mlr.press/v247/zhang24a.html
%V 247
%X A central issue lying at the heart of online reinforcement learning (RL) is data efficiency.  While a number of recent works achieved asymptotically minimal regret in online RL, the optimality of these results is only guaranteed in a “large-sample” regime, imposing enormous burn-in cost in order for their algorithms to operate optimally. How to achieve minimax-optimal regret without incurring any burn-in cost has been an open problem in RL theory.   We settle this problem for finite-horizon inhomogeneous Markov decision processes. Specifically, we prove that a modified version of MVP (Monotonic Value Propagation), an optimistic model-based algorithm proposed by Zhang et al., achieves a regret on the order of $$\min\big\{  \sqrt{SAH^3K}, \,HK \big\},$$ where $S$ is the number of states, $A$ is the number of actions, $H$ is the horizon length, and $K$ is the total number of episodes. This regret matches the minimax lower bound for the entire range of sample size K, essentially eliminating any burn-in requirement. It also translates to a PAC sample complexity (i.e., the number of episodes needed to yield $\varepsilon$-accuracy) of $\frac{SAH^3}{\varepsilon^2}$ up to log factor, which is minimax-optimal for the full epsilon-range. Further, we extend our theory to unveil the influences of problem-dependent quantities like the optimal value/cost and certain variances. The key technical innovation lies in a novel analysis paradigm to decouple complicated statistical dependency — a long-standing challenge facing the analysis of online RL in the sample-hungry regime.

APA


Zhang, Z., Chen, Y., Lee, J.D. & Du, S.S.. (2024). Settling the sample complexity of online reinforcement learning. Proceedings of Thirty Seventh Conference on Learning Theory, in Proceedings of Machine Learning Research 247:5213-5219 Available from https://proceedings.mlr.press/v247/zhang24a.html.

Related Material

Download PDF