Horizon-Free Reinforcement Learning in Polynomial Time: the Power of Stationary Policies

Zihan Zhang; Xiangyang Ji; Simon Du

Horizon-Free Reinforcement Learning in Polynomial Time: the Power of Stationary Policies

Zihan Zhang, Xiangyang Ji, Simon Du

Proceedings of Thirty Fifth Conference on Learning Theory, PMLR 178:3858-3904, 2022.

Abstract

This paper gives the first polynomial-time algorithm for tabular Markov Decision Processes (MDP) that enjoys a regret bound \emph{independent on the planning horizon}. Specifically, we consider tabular MDP with

$S$ states,

$A$ actions, a planning horizon

$H$ , total reward bounded by

$1$ , and the agent plays for

$K$ episodes. We design an algorithm that achieves an

$O\left(\mathrm{poly}(S,A,\log K)\sqrt{K}\right)$ regret in contrast to existing bounds which either has an additional

$\mathrm{polylog}(H)$ dependency \citep{zhang2020reinforcement} or has an exponential dependency on

$S$ \citep{li2021settling}. Our result relies on a sequence of new structural lemmas establishing the approximation power, stability, and concentration property of stationary policies, which can have applications in other problems related to Markov chains.

Cite this Paper

BibTeX


@InProceedings{pmlr-v178-zhang22a,
  title = 	 {Horizon-Free Reinforcement Learning in Polynomial Time: the Power of Stationary Policies},
  author =       {Zhang, Zihan and Ji, Xiangyang and Du, Simon},
  booktitle = 	 {Proceedings of Thirty Fifth Conference on Learning Theory},
  pages = 	 {3858--3904},
  year = 	 {2022},
  editor = 	 {Loh, Po-Ling and Raginsky, Maxim},
  volume = 	 {178},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {02--05 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v178/zhang22a/zhang22a.pdf},
  url = 	 {https://proceedings.mlr.press/v178/zhang22a.html},
  abstract = 	 {This paper gives the first polynomial-time algorithm for tabular Markov Decision Processes (MDP) that enjoys a regret bound \emph{independent on the planning horizon}.  Specifically, we consider tabular MDP with $S$ states, $A$ actions, a planning horizon $H$, total reward bounded by $1$, and the agent plays for $K$ episodes. We design an algorithm that achieves an  $O\left(\mathrm{poly}(S,A,\log K)\sqrt{K}\right)$ regret in contrast to existing bounds which either has an additional $\mathrm{polylog}(H)$ dependency \citep{zhang2020reinforcement} or has an exponential dependency on $S$ \citep{li2021settling}. Our result relies on a sequence of new structural lemmas establishing the approximation power, stability, and concentration property of stationary policies, which can have applications in other problems related to Markov chains.}
}

Endnote

%0 Conference Paper
%T Horizon-Free Reinforcement Learning in Polynomial Time: the Power of Stationary Policies
%A Zihan Zhang
%A Xiangyang Ji
%A Simon Du
%B Proceedings of Thirty Fifth Conference on Learning Theory
%C Proceedings of Machine Learning Research
%D 2022
%E Po-Ling Loh
%E Maxim Raginsky	
%F pmlr-v178-zhang22a
%I PMLR
%P 3858--3904
%U https://proceedings.mlr.press/v178/zhang22a.html
%V 178
%X This paper gives the first polynomial-time algorithm for tabular Markov Decision Processes (MDP) that enjoys a regret bound \emph{independent on the planning horizon}.  Specifically, we consider tabular MDP with $S$ states, $A$ actions, a planning horizon $H$, total reward bounded by $1$, and the agent plays for $K$ episodes. We design an algorithm that achieves an  $O\left(\mathrm{poly}(S,A,\log K)\sqrt{K}\right)$ regret in contrast to existing bounds which either has an additional $\mathrm{polylog}(H)$ dependency \citep{zhang2020reinforcement} or has an exponential dependency on $S$ \citep{li2021settling}. Our result relies on a sequence of new structural lemmas establishing the approximation power, stability, and concentration property of stationary policies, which can have applications in other problems related to Markov chains.

APA


Zhang, Z., Ji, X. & Du, S.. (2022). Horizon-Free Reinforcement Learning in Polynomial Time: the Power of Stationary Policies. Proceedings of Thirty Fifth Conference on Learning Theory, in Proceedings of Machine Learning Research 178:3858-3904 Available from https://proceedings.mlr.press/v178/zhang22a.html.

Related Material

Download PDF