[edit]

# Sharper Model-free Reinforcement Learning for Average-reward Markov Decision Processes

*Proceedings of Thirty Sixth Conference on Learning Theory*, PMLR 195:5476-5477, 2023.

#### Abstract

We study model-free reinforcement learning (RL) algorithms for infinite-horizon average-reward Markov decision process (MDP), which is more appropriate for applications that involve continuingoperations not divided into episodes. In contrast to episodic/discounted MDPs, theoretical understanding of model-free RL algorithms is relatively inadequate for the average-reward setting. In this paper, we consider both the online setting and the setting with access to a simulator. We develop computationally efficient model-free algorithms that achieve sharper guarantees on regret/sample complexity compared with existing results.In the online setting, we design an algorithm, $\mathtt{UCB-AVG}$, based on an optimistic variant of variance-reduced Q-learning. We show that $\mathtt{UCB-AVG}$ achieves a regret bound $\otilde(S^5A^2\spn(h^*)\sqrt{T})$ after $T$ steps, where $S\times A$ is the size of state-action space, and $\mathrm{sp}(h^*)$ the span of the optimal bias function.\footnote{We use the notation $\widetilde{O}(\cdot)$ to suppress constant and logarithmic factors.} Our result provides the first computationally efficient model-free algorithm that achieves the optimal dependence in $T$ (up to log factors) for weakly communicating MDPs, which is necessary for low regret \citep{bartlett2009regal}. In contrast, prior results either are suboptimal in $T$ or require strong assumptions of ergodicity or uniformly mixing of MDPs. In the simulator setting, we adapt the idea of $\mathtt{UCB-AVG}$ to develop a model-free algorithm that finds an $\epsilon$-optimal policy with sample complexity $\otilde\left( {SA\spn^2(h^*)\epsilon^{-2}} + S^2A\spn(h^*)\epsilon^{-1} \right).$ This sample complexity is near-optimal for weakly communicating MDPs, in view of the minimax lower bound $\Omega(SA\spn(^*)\epsilon^{-2})$ established in \cite{wang2022average}. Existing work mainly focuses on ergodic MDPs and the results typically depend on $\mix,$ the \emph{worst-case} mixing time induced by a policy. We remark that the diameter $D$ and mixing time $\mix$ are both lower bounded by $\spn(h^*)$, and $\mix$ can be arbitrarily large for certain MDPs \citep{wang2022average}.On the technical side, our approach integrates two key ideas: learning an $\gamma$-discounted MDP as an approximation, and leveraging reference-advantage decomposition for variance reduction \citep{zhang2020almost} in optimistic Q-learning. As recognized in prior work, a naive approximation by discounted MDPs results in suboptimal guarantees. A distinguishing feature of our method is maintaining estimates of \emph{value-difference} between state pairs to provide a sharper bound on the variance of reference advantage. We also crucially use a careful choice of the discounted factor $\gamma$ to balance approximation error due to discounting and the statistical learning error, and we are able to maintain a good-quality reference value function with $O(SA)$ space complexity.