Beyond No Regret: Instance-Dependent PAC Reinforcement Learning

Andrew J Wagenmaker; Max Simchowitz; Kevin Jamieson

Beyond No Regret: Instance-Dependent PAC Reinforcement Learning

Andrew J Wagenmaker, Max Simchowitz, Kevin Jamieson

Proceedings of Thirty Fifth Conference on Learning Theory, PMLR 178:358-418, 2022.

Abstract

The theory of reinforcement learning has focused on two fundamental problems: achieving low regret, and identifying

$\epsilon$ -optimal policies. While a simple reduction allows one to apply a low-regret algorithm to obtain an

$\epsilon$ -optimal policy and achieve the worst-case optimal rate, it is unknown whether low-regret algorithms can obtain the instance-optimal rate for policy identification. We show this is not possible—there exists a fundamental tradeoff between achieving low regret and identifying an

$\epsilon$ -optimal policy at the instance-optimal rate. Motivated by our negative finding, we propose a new measure of instance-dependent sample complexity for PAC tabular reinforcement learning which explicitly accounts for the attainable state visitation distributions in the underlying MDP. We then propose and analyze a novel, planning-based algorithm which attains this sample complexity—yielding a complexity which scales with the suboptimality gaps and the “reachability” of a state. We show our algorithm is nearly minimax optimal, and on several examples that our instance-dependent sample complexity offers significant improvements over worst-case bounds.

Cite this Paper

BibTeX


@InProceedings{pmlr-v178-wagenmaker22a,
  title = 	 {Beyond No Regret: Instance-Dependent PAC Reinforcement Learning},
  author =       {Wagenmaker, Andrew J and Simchowitz, Max and Jamieson, Kevin},
  booktitle = 	 {Proceedings of Thirty Fifth Conference on Learning Theory},
  pages = 	 {358--418},
  year = 	 {2022},
  editor = 	 {Loh, Po-Ling and Raginsky, Maxim},
  volume = 	 {178},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {02--05 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v178/wagenmaker22a/wagenmaker22a.pdf},
  url = 	 {https://proceedings.mlr.press/v178/wagenmaker22a.html},
  abstract = 	 {The theory of reinforcement learning has focused on two fundamental problems: achieving low regret, and identifying $\epsilon$-optimal policies. While a simple reduction allows one to apply a low-regret algorithm to obtain an $\epsilon$-optimal policy and achieve the worst-case optimal rate, it is unknown whether low-regret algorithms can obtain the instance-optimal rate for policy identification. We show this is not possible—there exists a fundamental tradeoff between achieving low regret and identifying an $\epsilon$-optimal policy at the instance-optimal rate. Motivated by our negative finding, we propose a new measure of instance-dependent sample complexity for PAC tabular reinforcement learning which explicitly accounts for the attainable state visitation distributions in the underlying MDP. We then propose and analyze a novel, planning-based algorithm which attains this sample complexity—yielding a complexity which scales with the suboptimality gaps and the “reachability” of a state. We show our algorithm is nearly minimax optimal, and on several examples that our instance-dependent sample complexity offers significant improvements over worst-case bounds.}
}

Endnote

%0 Conference Paper
%T Beyond No Regret: Instance-Dependent PAC Reinforcement Learning
%A Andrew J Wagenmaker
%A Max Simchowitz
%A Kevin Jamieson
%B Proceedings of Thirty Fifth Conference on Learning Theory
%C Proceedings of Machine Learning Research
%D 2022
%E Po-Ling Loh
%E Maxim Raginsky	
%F pmlr-v178-wagenmaker22a
%I PMLR
%P 358--418
%U https://proceedings.mlr.press/v178/wagenmaker22a.html
%V 178
%X The theory of reinforcement learning has focused on two fundamental problems: achieving low regret, and identifying $\epsilon$-optimal policies. While a simple reduction allows one to apply a low-regret algorithm to obtain an $\epsilon$-optimal policy and achieve the worst-case optimal rate, it is unknown whether low-regret algorithms can obtain the instance-optimal rate for policy identification. We show this is not possible—there exists a fundamental tradeoff between achieving low regret and identifying an $\epsilon$-optimal policy at the instance-optimal rate. Motivated by our negative finding, we propose a new measure of instance-dependent sample complexity for PAC tabular reinforcement learning which explicitly accounts for the attainable state visitation distributions in the underlying MDP. We then propose and analyze a novel, planning-based algorithm which attains this sample complexity—yielding a complexity which scales with the suboptimality gaps and the “reachability” of a state. We show our algorithm is nearly minimax optimal, and on several examples that our instance-dependent sample complexity offers significant improvements over worst-case bounds.

APA


Wagenmaker, A.J., Simchowitz, M. & Jamieson, K.. (2022). Beyond No Regret: Instance-Dependent PAC Reinforcement Learning. Proceedings of Thirty Fifth Conference on Learning Theory, in Proceedings of Machine Learning Research 178:358-418 Available from https://proceedings.mlr.press/v178/wagenmaker22a.html.

Related Material

Download PDF