Sharper Model-free Reinforcement Learning for Average-reward Markov Decision Processes

Zihan Zhang, Qiaomin Xie
Proceedings of Thirty Sixth Conference on Learning Theory, PMLR 195:5476-5477, 2023.

Abstract

We study model-free reinforcement learning (RL) algorithms for infinite-horizon average-reward Markov decision process (MDP), which is more appropriate for applications that involve continuingoperations not divided into episodes. In contrast to episodic/discounted MDPs, theoretical understanding of model-free RL algorithms is relatively inadequate for the average-reward setting. In this paper, we consider both the online setting and the setting with access to a simulator. We develop computationally efficient model-free algorithms that achieve sharper guarantees on regret/sample complexity compared with existing results.In the online setting, we design an algorithm, $\mathtt{UCB-AVG}$, based on an optimistic variant of variance-reduced Q-learning. We show that $\mathtt{UCB-AVG}$ achieves a regret bound $\otilde(S^5A^2\spn(h^*)\sqrt{T})$ after $T$ steps, where $S\times A$ is the size of state-action space, and $\mathrm{sp}(h^*)$ the span of the optimal bias function.\footnote{We use the notation $\widetilde{O}(\cdot)$ to suppress constant and logarithmic factors.} Our result provides the first computationally efficient model-free algorithm that achieves the optimal dependence in $T$ (up to log factors) for weakly communicating MDPs, which is necessary for low regret \citep{bartlett2009regal}. In contrast, prior results either are suboptimal in $T$ or require strong assumptions of ergodicity or uniformly mixing of MDPs. In the simulator setting, we adapt the idea of $\mathtt{UCB-AVG}$ to develop a model-free algorithm that finds an $\epsilon$-optimal policy with sample complexity $\otilde\left( {SA\spn^2(h^*)\epsilon^{-2}} + S^2A\spn(h^*)\epsilon^{-1} \right).$ This sample complexity is near-optimal for weakly communicating MDPs, in view of the minimax lower bound $\Omega(SA\spn(^*)\epsilon^{-2})$ established in \cite{wang2022average}. Existing work mainly focuses on ergodic MDPs and the results typically depend on $\mix,$ the \emph{worst-case} mixing time induced by a policy. We remark that the diameter $D$ and mixing time $\mix$ are both lower bounded by $\spn(h^*)$, and $\mix$ can be arbitrarily large for certain MDPs \citep{wang2022average}.On the technical side, our approach integrates two key ideas: learning an $\gamma$-discounted MDP as an approximation, and leveraging reference-advantage decomposition for variance reduction \citep{zhang2020almost} in optimistic Q-learning. As recognized in prior work, a naive approximation by discounted MDPs results in suboptimal guarantees. A distinguishing feature of our method is maintaining estimates of \emph{value-difference} between state pairs to provide a sharper bound on the variance of reference advantage. We also crucially use a careful choice of the discounted factor $\gamma$ to balance approximation error due to discounting and the statistical learning error, and we are able to maintain a good-quality reference value function with $O(SA)$ space complexity.

Cite this Paper


BibTeX
@InProceedings{pmlr-v195-zhang23b, title = {Sharper Model-free Reinforcement Learning for Average-reward Markov Decision Processes}, author = {Zhang, Zihan and Xie, Qiaomin}, booktitle = {Proceedings of Thirty Sixth Conference on Learning Theory}, pages = {5476--5477}, year = {2023}, editor = {Neu, Gergely and Rosasco, Lorenzo}, volume = {195}, series = {Proceedings of Machine Learning Research}, month = {12--15 Jul}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v195/zhang23b/zhang23b.pdf}, url = {https://proceedings.mlr.press/v195/zhang23b.html}, abstract = { We study model-free reinforcement learning (RL) algorithms for infinite-horizon average-reward Markov decision process (MDP), which is more appropriate for applications that involve continuingoperations not divided into episodes. In contrast to episodic/discounted MDPs, theoretical understanding of model-free RL algorithms is relatively inadequate for the average-reward setting. In this paper, we consider both the online setting and the setting with access to a simulator. We develop computationally efficient model-free algorithms that achieve sharper guarantees on regret/sample complexity compared with existing results.In the online setting, we design an algorithm, $\mathtt{UCB-AVG}$, based on an optimistic variant of variance-reduced Q-learning. We show that $\mathtt{UCB-AVG}$ achieves a regret bound $\otilde(S^5A^2\spn(h^*)\sqrt{T})$ after $T$ steps, where $S\times A$ is the size of state-action space, and $\mathrm{sp}(h^*)$ the span of the optimal bias function.\footnote{We use the notation $\widetilde{O}(\cdot)$ to suppress constant and logarithmic factors.} Our result provides the first computationally efficient model-free algorithm that achieves the optimal dependence in $T$ (up to log factors) for weakly communicating MDPs, which is necessary for low regret \citep{bartlett2009regal}. In contrast, prior results either are suboptimal in $T$ or require strong assumptions of ergodicity or uniformly mixing of MDPs. In the simulator setting, we adapt the idea of $\mathtt{UCB-AVG}$ to develop a model-free algorithm that finds an $\epsilon$-optimal policy with sample complexity $\otilde\left( {SA\spn^2(h^*)\epsilon^{-2}} + S^2A\spn(h^*)\epsilon^{-1} \right).$ This sample complexity is near-optimal for weakly communicating MDPs, in view of the minimax lower bound $\Omega(SA\spn(^*)\epsilon^{-2})$ established in \cite{wang2022average}. Existing work mainly focuses on ergodic MDPs and the results typically depend on $\mix,$ the \emph{worst-case} mixing time induced by a policy. We remark that the diameter $D$ and mixing time $\mix$ are both lower bounded by $\spn(h^*)$, and $\mix$ can be arbitrarily large for certain MDPs \citep{wang2022average}.On the technical side, our approach integrates two key ideas: learning an $\gamma$-discounted MDP as an approximation, and leveraging reference-advantage decomposition for variance reduction \citep{zhang2020almost} in optimistic Q-learning. As recognized in prior work, a naive approximation by discounted MDPs results in suboptimal guarantees. A distinguishing feature of our method is maintaining estimates of \emph{value-difference} between state pairs to provide a sharper bound on the variance of reference advantage. We also crucially use a careful choice of the discounted factor $\gamma$ to balance approximation error due to discounting and the statistical learning error, and we are able to maintain a good-quality reference value function with $O(SA)$ space complexity. } }
Endnote
%0 Conference Paper %T Sharper Model-free Reinforcement Learning for Average-reward Markov Decision Processes %A Zihan Zhang %A Qiaomin Xie %B Proceedings of Thirty Sixth Conference on Learning Theory %C Proceedings of Machine Learning Research %D 2023 %E Gergely Neu %E Lorenzo Rosasco %F pmlr-v195-zhang23b %I PMLR %P 5476--5477 %U https://proceedings.mlr.press/v195/zhang23b.html %V 195 %X We study model-free reinforcement learning (RL) algorithms for infinite-horizon average-reward Markov decision process (MDP), which is more appropriate for applications that involve continuingoperations not divided into episodes. In contrast to episodic/discounted MDPs, theoretical understanding of model-free RL algorithms is relatively inadequate for the average-reward setting. In this paper, we consider both the online setting and the setting with access to a simulator. We develop computationally efficient model-free algorithms that achieve sharper guarantees on regret/sample complexity compared with existing results.In the online setting, we design an algorithm, $\mathtt{UCB-AVG}$, based on an optimistic variant of variance-reduced Q-learning. We show that $\mathtt{UCB-AVG}$ achieves a regret bound $\otilde(S^5A^2\spn(h^*)\sqrt{T})$ after $T$ steps, where $S\times A$ is the size of state-action space, and $\mathrm{sp}(h^*)$ the span of the optimal bias function.\footnote{We use the notation $\widetilde{O}(\cdot)$ to suppress constant and logarithmic factors.} Our result provides the first computationally efficient model-free algorithm that achieves the optimal dependence in $T$ (up to log factors) for weakly communicating MDPs, which is necessary for low regret \citep{bartlett2009regal}. In contrast, prior results either are suboptimal in $T$ or require strong assumptions of ergodicity or uniformly mixing of MDPs. In the simulator setting, we adapt the idea of $\mathtt{UCB-AVG}$ to develop a model-free algorithm that finds an $\epsilon$-optimal policy with sample complexity $\otilde\left( {SA\spn^2(h^*)\epsilon^{-2}} + S^2A\spn(h^*)\epsilon^{-1} \right).$ This sample complexity is near-optimal for weakly communicating MDPs, in view of the minimax lower bound $\Omega(SA\spn(^*)\epsilon^{-2})$ established in \cite{wang2022average}. Existing work mainly focuses on ergodic MDPs and the results typically depend on $\mix,$ the \emph{worst-case} mixing time induced by a policy. We remark that the diameter $D$ and mixing time $\mix$ are both lower bounded by $\spn(h^*)$, and $\mix$ can be arbitrarily large for certain MDPs \citep{wang2022average}.On the technical side, our approach integrates two key ideas: learning an $\gamma$-discounted MDP as an approximation, and leveraging reference-advantage decomposition for variance reduction \citep{zhang2020almost} in optimistic Q-learning. As recognized in prior work, a naive approximation by discounted MDPs results in suboptimal guarantees. A distinguishing feature of our method is maintaining estimates of \emph{value-difference} between state pairs to provide a sharper bound on the variance of reference advantage. We also crucially use a careful choice of the discounted factor $\gamma$ to balance approximation error due to discounting and the statistical learning error, and we are able to maintain a good-quality reference value function with $O(SA)$ space complexity.
APA
Zhang, Z. & Xie, Q.. (2023). Sharper Model-free Reinforcement Learning for Average-reward Markov Decision Processes. Proceedings of Thirty Sixth Conference on Learning Theory, in Proceedings of Machine Learning Research 195:5476-5477 Available from https://proceedings.mlr.press/v195/zhang23b.html.

Related Material