[edit]
Feasible $Q$-Learning for Average Reward Reinforcement Learning
Proceedings of The 27th International Conference on Artificial Intelligence and Statistics, PMLR 238:1630-1638, 2024.
Abstract
Average reward reinforcement learning (RL) provides a suitable framework for capturing the objective (i.e. long-run average reward) for continuing tasks, where there is often no natural way to identify a discount factor. However, existing average reward RL algorithms with sample complexity guarantees are not feasible, as they take as input the (unknown) mixing time of the Markov decision process (MDP). In this paper, we make initial progress towards addressing this open problem. We design a feasible average-reward $Q$-learning framework that requires no knowledge of any problem parameter as input. Our framework is based on discounted $Q$-learning, while we dynamically adapt the discount factor (and hence the effective horizon) to progressively approximate the average reward. In the synchronous setting, we solve three tasks: (i) learn a policy that is $\epsilon$-close to optimal, (ii) estimate optimal average reward with $\epsilon$-accuracy, and (iii) estimate the bias function (similar to $Q$-function in discounted case) with $\epsilon$-accuracy. We show that with carefully designed adaptation schemes, (i) can be achieved with $\tilde{O}(\frac{SA t_{\mathrm{mix}}^{8}}{\epsilon^{8}})$ samples, (ii) with $\tilde{O}(\frac{SA t_{\mathrm{mix}}^5}{\epsilon^5})$ samples, and (iii) with $\tilde{O}(\frac{SA B}{\epsilon^9})$ samples, where $t_\mathrm{mix}$ is the mixing time, and $B > 0$ is an MDP-dependent constant. To our knowledge, we provide the first finite-sample guarantees that are polynomial in $S, A, t_{\mathrm{mix}}, \epsilon$ for a feasible variant of $Q$-learning. That said, the sample complexity bounds have tremendous room for improvement, which we leave for the community’s best minds. Preliminary simulations verify that our framework is effective without prior knowledge of parameters as input.