Towards Tight Bounds on the Sample Complexity of Average-reward MDPs

Yujia Jin, Aaron Sidford
Proceedings of the 38th International Conference on Machine Learning, PMLR 139:5055-5064, 2021.

Abstract

We prove new upper and lower bounds for sample complexity of finding an $\epsilon$-optimal policy of an infinite-horizon average-reward Markov decision process (MDP) given access to a generative model. When the mixing time of the probability transition matrix of all policies is at most $t_\mathrm{mix}$, we provide an algorithm that solves the problem using $\widetilde{O}(t_\mathrm{mix} \epsilon^{-3})$ (oblivious) samples per state-action pair. Further, we provide a lower bound showing that a linear dependence on $t_\mathrm{mix}$ is necessary in the worst case for any algorithm which computes oblivious samples. We obtain our results by establishing connections between infinite-horizon average-reward MDPs and discounted MDPs of possible further utility.

Cite this Paper


BibTeX
@InProceedings{pmlr-v139-jin21b, title = {Towards Tight Bounds on the Sample Complexity of Average-reward MDPs}, author = {Jin, Yujia and Sidford, Aaron}, booktitle = {Proceedings of the 38th International Conference on Machine Learning}, pages = {5055--5064}, year = {2021}, editor = {Meila, Marina and Zhang, Tong}, volume = {139}, series = {Proceedings of Machine Learning Research}, month = {18--24 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v139/jin21b/jin21b.pdf}, url = {https://proceedings.mlr.press/v139/jin21b.html}, abstract = {We prove new upper and lower bounds for sample complexity of finding an $\epsilon$-optimal policy of an infinite-horizon average-reward Markov decision process (MDP) given access to a generative model. When the mixing time of the probability transition matrix of all policies is at most $t_\mathrm{mix}$, we provide an algorithm that solves the problem using $\widetilde{O}(t_\mathrm{mix} \epsilon^{-3})$ (oblivious) samples per state-action pair. Further, we provide a lower bound showing that a linear dependence on $t_\mathrm{mix}$ is necessary in the worst case for any algorithm which computes oblivious samples. We obtain our results by establishing connections between infinite-horizon average-reward MDPs and discounted MDPs of possible further utility.} }
Endnote
%0 Conference Paper %T Towards Tight Bounds on the Sample Complexity of Average-reward MDPs %A Yujia Jin %A Aaron Sidford %B Proceedings of the 38th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2021 %E Marina Meila %E Tong Zhang %F pmlr-v139-jin21b %I PMLR %P 5055--5064 %U https://proceedings.mlr.press/v139/jin21b.html %V 139 %X We prove new upper and lower bounds for sample complexity of finding an $\epsilon$-optimal policy of an infinite-horizon average-reward Markov decision process (MDP) given access to a generative model. When the mixing time of the probability transition matrix of all policies is at most $t_\mathrm{mix}$, we provide an algorithm that solves the problem using $\widetilde{O}(t_\mathrm{mix} \epsilon^{-3})$ (oblivious) samples per state-action pair. Further, we provide a lower bound showing that a linear dependence on $t_\mathrm{mix}$ is necessary in the worst case for any algorithm which computes oblivious samples. We obtain our results by establishing connections between infinite-horizon average-reward MDPs and discounted MDPs of possible further utility.
APA
Jin, Y. & Sidford, A.. (2021). Towards Tight Bounds on the Sample Complexity of Average-reward MDPs. Proceedings of the 38th International Conference on Machine Learning, in Proceedings of Machine Learning Research 139:5055-5064 Available from https://proceedings.mlr.press/v139/jin21b.html.

Related Material