Sharper Model-free Reinforcement Learning for Average-reward Markov Decision Processes

Zihan Zhang; Qiaomin Xie

Sharper Model-free Reinforcement Learning for Average-reward Markov Decision Processes

Zihan Zhang, Qiaomin Xie

Proceedings of Thirty Sixth Conference on Learning Theory, PMLR 195:5476-5477, 2023.

Abstract

We study model-free reinforcement learning (RL) algorithms for infinite-horizon average-reward Markov decision process (MDP), which is more appropriate for applications that involve continuingoperations not divided into episodes. In contrast to episodic/discounted MDPs, theoretical understanding of model-free RL algorithms is relatively inadequate for the average-reward setting. In this paper, we consider both the online setting and the setting with access to a simulator. We develop computationally efficient model-free algorithms that achieve sharper guarantees on regret/sample complexity compared with existing results.In the online setting, we design an algorithm,

$\mathtt{UCB-AVG}$ , based on an optimistic variant of variance-reduced Q-learning. We show that

$\mathtt{UCB-AVG}$ achieves a regret bound

$\otilde(S^5A^2\spn(h^*)\sqrt{T})$ after

$T$ steps, where

$S\times A$ is the size of state-action space, and

$\mathrm{sp}(h^*)$ the span of the optimal bias function.\footnote{We use the notation

$\widetilde{O}(\cdot)$ to suppress constant and logarithmic factors.} Our result provides the first computationally efficient model-free algorithm that achieves the optimal dependence in

$T$ (up to log factors) for weakly communicating MDPs, which is necessary for low regret \citep{bartlett2009regal}. In contrast, prior results either are suboptimal in

$T$ or require strong assumptions of ergodicity or uniformly mixing of MDPs. In the simulator setting, we adapt the idea of

$\mathtt{UCB-AVG}$ to develop a model-free algorithm that finds an

$\epsilon$ -optimal policy with sample complexity

$\otilde\left( {SA\spn^2(h^*)\epsilon^{-2}} + S^2A\spn(h^*)\epsilon^{-1} \right).$ This sample complexity is near-optimal for weakly communicating MDPs, in view of the minimax lower bound

$\Omega(SA\spn(^*)\epsilon^{-2})$ established in \cite{wang2022average}. Existing work mainly focuses on ergodic MDPs and the results typically depend on

$\mix,$ the \emph{worst-case} mixing time induced by a policy. We remark that the diameter

$D$ and mixing time

$\mix$ are both lower bounded by

$\spn(h^*)$ , and

$\mix$ can be arbitrarily large for certain MDPs \citep{wang2022average}.On the technical side, our approach integrates two key ideas: learning an

$\gamma$ -discounted MDP as an approximation, and leveraging reference-advantage decomposition for variance reduction \citep{zhang2020almost} in optimistic Q-learning. As recognized in prior work, a naive approximation by discounted MDPs results in suboptimal guarantees. A distinguishing feature of our method is maintaining estimates of \emph{value-difference} between state pairs to provide a sharper bound on the variance of reference advantage. We also crucially use a careful choice of the discounted factor

$\gamma$ to balance approximation error due to discounting and the statistical learning error, and we are able to maintain a good-quality reference value function with

$O(SA)$ space complexity.

Cite this Paper

BibTeX


@InProceedings{pmlr-v195-zhang23b,
  title = 	 {Sharper Model-free Reinforcement Learning for Average-reward Markov Decision Processes},
  author =       {Zhang, Zihan and Xie, Qiaomin},
  booktitle = 	 {Proceedings of Thirty Sixth Conference on Learning Theory},
  pages = 	 {5476--5477},
  year = 	 {2023},
  editor = 	 {Neu, Gergely and Rosasco, Lorenzo},
  volume = 	 {195},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {12--15 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v195/zhang23b/zhang23b.pdf},
  url = 	 {https://proceedings.mlr.press/v195/zhang23b.html},
  abstract = 	 {    We study model-free reinforcement learning (RL) algorithms for infinite-horizon average-reward Markov decision process (MDP), which is more appropriate for applications that involve continuingoperations not divided into episodes. In contrast to episodic/discounted MDPs, theoretical understanding of model-free RL algorithms is relatively inadequate for the average-reward setting. In this paper, we consider both the online setting and the setting with access to a simulator. We develop computationally efficient model-free algorithms that achieve sharper guarantees on regret/sample complexity compared with existing results.In the online setting, we design an algorithm, $\mathtt{UCB-AVG}$, based on an optimistic variant of variance-reduced Q-learning. We show that $\mathtt{UCB-AVG}$ achieves a regret bound $\otilde(S^5A^2\spn(h^*)\sqrt{T})$ after $T$ steps, where $S\times A$ is the size of state-action space, and     $\mathrm{sp}(h^*)$ the span of the optimal bias function.\footnote{We use the notation $\widetilde{O}(\cdot)$ to suppress constant and logarithmic factors.} Our result provides the first computationally efficient model-free algorithm that achieves the optimal dependence in $T$ (up to log factors) for weakly communicating MDPs, which is necessary for low regret \citep{bartlett2009regal}. In contrast, prior results either are suboptimal in $T$ or require strong assumptions of ergodicity or uniformly mixing of MDPs. In the simulator setting, we adapt the idea of $\mathtt{UCB-AVG}$ to develop a model-free algorithm that finds an $\epsilon$-optimal policy with sample complexity $\otilde\left( {SA\spn^2(h^*)\epsilon^{-2}} + S^2A\spn(h^*)\epsilon^{-1} \right).$ This sample complexity is near-optimal for weakly communicating MDPs, in view of the minimax lower bound $\Omega(SA\spn(^*)\epsilon^{-2})$ established in \cite{wang2022average}. Existing work mainly focuses on ergodic MDPs and the results typically depend on $\mix,$ the \emph{worst-case} mixing time induced by a policy. We remark that  the diameter $D$ and mixing time $\mix$ are both lower bounded by $\spn(h^*)$, and $\mix$ can be arbitrarily large for certain MDPs \citep{wang2022average}.On the technical side, our approach integrates two key ideas: learning an $\gamma$-discounted MDP as an approximation, and leveraging reference-advantage decomposition for variance reduction \citep{zhang2020almost} in optimistic Q-learning. As recognized in prior work, a naive approximation by discounted MDPs results in suboptimal guarantees. A distinguishing feature of our method is maintaining estimates of \emph{value-difference} between state pairs to provide a sharper bound on the variance of reference advantage. We also crucially use a careful choice of the discounted factor $\gamma$ to balance approximation error due to discounting and the statistical learning error, and we are able to maintain a good-quality reference value function with  $O(SA)$ space complexity. }
}

Endnote

%0 Conference Paper
%T Sharper Model-free Reinforcement Learning for Average-reward Markov Decision Processes
%A Zihan Zhang
%A Qiaomin Xie
%B Proceedings of Thirty Sixth Conference on Learning Theory
%C Proceedings of Machine Learning Research
%D 2023
%E Gergely Neu
%E Lorenzo Rosasco	
%F pmlr-v195-zhang23b
%I PMLR
%P 5476--5477
%U https://proceedings.mlr.press/v195/zhang23b.html
%V 195
%X     We study model-free reinforcement learning (RL) algorithms for infinite-horizon average-reward Markov decision process (MDP), which is more appropriate for applications that involve continuingoperations not divided into episodes. In contrast to episodic/discounted MDPs, theoretical understanding of model-free RL algorithms is relatively inadequate for the average-reward setting. In this paper, we consider both the online setting and the setting with access to a simulator. We develop computationally efficient model-free algorithms that achieve sharper guarantees on regret/sample complexity compared with existing results.In the online setting, we design an algorithm, $\mathtt{UCB-AVG}$, based on an optimistic variant of variance-reduced Q-learning. We show that $\mathtt{UCB-AVG}$ achieves a regret bound $\otilde(S^5A^2\spn(h^*)\sqrt{T})$ after $T$ steps, where $S\times A$ is the size of state-action space, and     $\mathrm{sp}(h^*)$ the span of the optimal bias function.\footnote{We use the notation $\widetilde{O}(\cdot)$ to suppress constant and logarithmic factors.} Our result provides the first computationally efficient model-free algorithm that achieves the optimal dependence in $T$ (up to log factors) for weakly communicating MDPs, which is necessary for low regret \citep{bartlett2009regal}. In contrast, prior results either are suboptimal in $T$ or require strong assumptions of ergodicity or uniformly mixing of MDPs. In the simulator setting, we adapt the idea of $\mathtt{UCB-AVG}$ to develop a model-free algorithm that finds an $\epsilon$-optimal policy with sample complexity $\otilde\left( {SA\spn^2(h^*)\epsilon^{-2}} + S^2A\spn(h^*)\epsilon^{-1} \right).$ This sample complexity is near-optimal for weakly communicating MDPs, in view of the minimax lower bound $\Omega(SA\spn(^*)\epsilon^{-2})$ established in \cite{wang2022average}. Existing work mainly focuses on ergodic MDPs and the results typically depend on $\mix,$ the \emph{worst-case} mixing time induced by a policy. We remark that  the diameter $D$ and mixing time $\mix$ are both lower bounded by $\spn(h^*)$, and $\mix$ can be arbitrarily large for certain MDPs \citep{wang2022average}.On the technical side, our approach integrates two key ideas: learning an $\gamma$-discounted MDP as an approximation, and leveraging reference-advantage decomposition for variance reduction \citep{zhang2020almost} in optimistic Q-learning. As recognized in prior work, a naive approximation by discounted MDPs results in suboptimal guarantees. A distinguishing feature of our method is maintaining estimates of \emph{value-difference} between state pairs to provide a sharper bound on the variance of reference advantage. We also crucially use a careful choice of the discounted factor $\gamma$ to balance approximation error due to discounting and the statistical learning error, and we are able to maintain a good-quality reference value function with  $O(SA)$ space complexity.

APA


Zhang, Z. & Xie, Q.. (2023). Sharper Model-free Reinforcement Learning for Average-reward Markov Decision Processes. Proceedings of Thirty Sixth Conference on Learning Theory, in Proceedings of Machine Learning Research 195:5476-5477 Available from https://proceedings.mlr.press/v195/zhang23b.html.

Related Material

Download PDF