[edit]
Sharper Model-free Reinforcement Learning for Average-reward Markov Decision Processes
Proceedings of Thirty Sixth Conference on Learning Theory, PMLR 195:5476-5477, 2023.
Abstract
We study model-free reinforcement learning (RL) algorithms for infinite-horizon average-reward Markov decision process (MDP), which is more appropriate for applications that involve continuingoperations not divided into episodes. In contrast to episodic/discounted MDPs, theoretical understanding of model-free RL algorithms is relatively inadequate for the average-reward setting. In this paper, we consider both the online setting and the setting with access to a simulator. We develop computationally efficient model-free algorithms that achieve sharper guarantees on regret/sample complexity compared with existing results.In the online setting, we design an algorithm, UCB−AVG, based on an optimistic variant of variance-reduced Q-learning. We show that UCB−AVG achieves a regret bound \otilde(S5A2\spn(h∗)√T) after T steps, where S×A is the size of state-action space, and sp(h∗) the span of the optimal bias function.\footnote{We use the notation ˜O(⋅) to suppress constant and logarithmic factors.} Our result provides the first computationally efficient model-free algorithm that achieves the optimal dependence in T (up to log factors) for weakly communicating MDPs, which is necessary for low regret \citep{bartlett2009regal}. In contrast, prior results either are suboptimal in T or require strong assumptions of ergodicity or uniformly mixing of MDPs. In the simulator setting, we adapt the idea of UCB−AVG to develop a model-free algorithm that finds an ϵ-optimal policy with sample complexity \otilde(SA\spn2(h∗)ϵ−2+S2A\spn(h∗)ϵ−1). This sample complexity is near-optimal for weakly communicating MDPs, in view of the minimax lower bound Ω(SA\spn(∗)ϵ−2) established in \cite{wang2022average}. Existing work mainly focuses on ergodic MDPs and the results typically depend on \mix, the \emph{worst-case} mixing time induced by a policy. We remark that the diameter D and mixing time \mix are both lower bounded by \spn(h∗), and \mix can be arbitrarily large for certain MDPs \citep{wang2022average}.On the technical side, our approach integrates two key ideas: learning an γ-discounted MDP as an approximation, and leveraging reference-advantage decomposition for variance reduction \citep{zhang2020almost} in optimistic Q-learning. As recognized in prior work, a naive approximation by discounted MDPs results in suboptimal guarantees. A distinguishing feature of our method is maintaining estimates of \emph{value-difference} between state pairs to provide a sharper bound on the variance of reference advantage. We also crucially use a careful choice of the discounted factor γ to balance approximation error due to discounting and the statistical learning error, and we are able to maintain a good-quality reference value function with O(SA) space complexity.