Feasible $Q$-Learning for Average Reward Reinforcement Learning

Ying Jin, Ramki Gummadi, Zhengyuan Zhou, Jose Blanchet
Proceedings of The 27th International Conference on Artificial Intelligence and Statistics, PMLR 238:1630-1638, 2024.

Abstract

Average reward reinforcement learning (RL) provides a suitable framework for capturing the objective (i.e. long-run average reward) for continuing tasks, where there is often no natural way to identify a discount factor. However, existing average reward RL algorithms with sample complexity guarantees are not feasible, as they take as input the (unknown) mixing time of the Markov decision process (MDP). In this paper, we make initial progress towards addressing this open problem. We design a feasible average-reward $Q$-learning framework that requires no knowledge of any problem parameter as input. Our framework is based on discounted $Q$-learning, while we dynamically adapt the discount factor (and hence the effective horizon) to progressively approximate the average reward. In the synchronous setting, we solve three tasks: (i) learn a policy that is $\epsilon$-close to optimal, (ii) estimate optimal average reward with $\epsilon$-accuracy, and (iii) estimate the bias function (similar to $Q$-function in discounted case) with $\epsilon$-accuracy. We show that with carefully designed adaptation schemes, (i) can be achieved with $\tilde{O}(\frac{SA t_{\mathrm{mix}}^{8}}{\epsilon^{8}})$ samples, (ii) with $\tilde{O}(\frac{SA t_{\mathrm{mix}}^5}{\epsilon^5})$ samples, and (iii) with $\tilde{O}(\frac{SA B}{\epsilon^9})$ samples, where $t_\mathrm{mix}$ is the mixing time, and $B > 0$ is an MDP-dependent constant. To our knowledge, we provide the first finite-sample guarantees that are polynomial in $S, A, t_{\mathrm{mix}}, \epsilon$ for a feasible variant of $Q$-learning. That said, the sample complexity bounds have tremendous room for improvement, which we leave for the community’s best minds. Preliminary simulations verify that our framework is effective without prior knowledge of parameters as input.

Cite this Paper


BibTeX
@InProceedings{pmlr-v238-jin24b, title = {Feasible $Q$-Learning for Average Reward Reinforcement Learning}, author = {Jin, Ying and Gummadi, Ramki and Zhou, Zhengyuan and Blanchet, Jose}, booktitle = {Proceedings of The 27th International Conference on Artificial Intelligence and Statistics}, pages = {1630--1638}, year = {2024}, editor = {Dasgupta, Sanjoy and Mandt, Stephan and Li, Yingzhen}, volume = {238}, series = {Proceedings of Machine Learning Research}, month = {02--04 May}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v238/jin24b/jin24b.pdf}, url = {https://proceedings.mlr.press/v238/jin24b.html}, abstract = {Average reward reinforcement learning (RL) provides a suitable framework for capturing the objective (i.e. long-run average reward) for continuing tasks, where there is often no natural way to identify a discount factor. However, existing average reward RL algorithms with sample complexity guarantees are not feasible, as they take as input the (unknown) mixing time of the Markov decision process (MDP). In this paper, we make initial progress towards addressing this open problem. We design a feasible average-reward $Q$-learning framework that requires no knowledge of any problem parameter as input. Our framework is based on discounted $Q$-learning, while we dynamically adapt the discount factor (and hence the effective horizon) to progressively approximate the average reward. In the synchronous setting, we solve three tasks: (i) learn a policy that is $\epsilon$-close to optimal, (ii) estimate optimal average reward with $\epsilon$-accuracy, and (iii) estimate the bias function (similar to $Q$-function in discounted case) with $\epsilon$-accuracy. We show that with carefully designed adaptation schemes, (i) can be achieved with $\tilde{O}(\frac{SA t_{\mathrm{mix}}^{8}}{\epsilon^{8}})$ samples, (ii) with $\tilde{O}(\frac{SA t_{\mathrm{mix}}^5}{\epsilon^5})$ samples, and (iii) with $\tilde{O}(\frac{SA B}{\epsilon^9})$ samples, where $t_\mathrm{mix}$ is the mixing time, and $B > 0$ is an MDP-dependent constant. To our knowledge, we provide the first finite-sample guarantees that are polynomial in $S, A, t_{\mathrm{mix}}, \epsilon$ for a feasible variant of $Q$-learning. That said, the sample complexity bounds have tremendous room for improvement, which we leave for the community’s best minds. Preliminary simulations verify that our framework is effective without prior knowledge of parameters as input.} }
Endnote
%0 Conference Paper %T Feasible $Q$-Learning for Average Reward Reinforcement Learning %A Ying Jin %A Ramki Gummadi %A Zhengyuan Zhou %A Jose Blanchet %B Proceedings of The 27th International Conference on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 2024 %E Sanjoy Dasgupta %E Stephan Mandt %E Yingzhen Li %F pmlr-v238-jin24b %I PMLR %P 1630--1638 %U https://proceedings.mlr.press/v238/jin24b.html %V 238 %X Average reward reinforcement learning (RL) provides a suitable framework for capturing the objective (i.e. long-run average reward) for continuing tasks, where there is often no natural way to identify a discount factor. However, existing average reward RL algorithms with sample complexity guarantees are not feasible, as they take as input the (unknown) mixing time of the Markov decision process (MDP). In this paper, we make initial progress towards addressing this open problem. We design a feasible average-reward $Q$-learning framework that requires no knowledge of any problem parameter as input. Our framework is based on discounted $Q$-learning, while we dynamically adapt the discount factor (and hence the effective horizon) to progressively approximate the average reward. In the synchronous setting, we solve three tasks: (i) learn a policy that is $\epsilon$-close to optimal, (ii) estimate optimal average reward with $\epsilon$-accuracy, and (iii) estimate the bias function (similar to $Q$-function in discounted case) with $\epsilon$-accuracy. We show that with carefully designed adaptation schemes, (i) can be achieved with $\tilde{O}(\frac{SA t_{\mathrm{mix}}^{8}}{\epsilon^{8}})$ samples, (ii) with $\tilde{O}(\frac{SA t_{\mathrm{mix}}^5}{\epsilon^5})$ samples, and (iii) with $\tilde{O}(\frac{SA B}{\epsilon^9})$ samples, where $t_\mathrm{mix}$ is the mixing time, and $B > 0$ is an MDP-dependent constant. To our knowledge, we provide the first finite-sample guarantees that are polynomial in $S, A, t_{\mathrm{mix}}, \epsilon$ for a feasible variant of $Q$-learning. That said, the sample complexity bounds have tremendous room for improvement, which we leave for the community’s best minds. Preliminary simulations verify that our framework is effective without prior knowledge of parameters as input.
APA
Jin, Y., Gummadi, R., Zhou, Z. & Blanchet, J.. (2024). Feasible $Q$-Learning for Average Reward Reinforcement Learning. Proceedings of The 27th International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 238:1630-1638 Available from https://proceedings.mlr.press/v238/jin24b.html.

Related Material