Towards Tight Bounds on the Sample Complexity of Average-reward MDPs

Yujia Jin; Aaron Sidford

Towards Tight Bounds on the Sample Complexity of Average-reward MDPs

Yujia Jin, Aaron Sidford

Proceedings of the 38th International Conference on Machine Learning, PMLR 139:5055-5064, 2021.

Abstract

We prove new upper and lower bounds for sample complexity of finding an

$\epsilon$ -optimal policy of an infinite-horizon average-reward Markov decision process (MDP) given access to a generative model. When the mixing time of the probability transition matrix of all policies is at most

$t_\mathrm{mix}$ , we provide an algorithm that solves the problem using

$\widetilde{O}(t_\mathrm{mix} \epsilon^{-3})$ (oblivious) samples per state-action pair. Further, we provide a lower bound showing that a linear dependence on

$t_\mathrm{mix}$ is necessary in the worst case for any algorithm which computes oblivious samples. We obtain our results by establishing connections between infinite-horizon average-reward MDPs and discounted MDPs of possible further utility.

Cite this Paper

BibTeX


@InProceedings{pmlr-v139-jin21b,
  title = 	 {Towards Tight Bounds on the Sample Complexity of Average-reward MDPs},
  author =       {Jin, Yujia and Sidford, Aaron},
  booktitle = 	 {Proceedings of the 38th International Conference on Machine Learning},
  pages = 	 {5055--5064},
  year = 	 {2021},
  editor = 	 {Meila, Marina and Zhang, Tong},
  volume = 	 {139},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {18--24 Jul},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v139/jin21b/jin21b.pdf},
  url = 	 {https://proceedings.mlr.press/v139/jin21b.html},
  abstract = 	 {We prove new upper and lower bounds for sample complexity of finding an $\epsilon$-optimal policy of an infinite-horizon average-reward Markov decision process (MDP) given access to a generative model. When the mixing time of the probability transition matrix of all policies is at most $t_\mathrm{mix}$, we provide an algorithm that solves the problem using $\widetilde{O}(t_\mathrm{mix} \epsilon^{-3})$ (oblivious) samples per state-action pair. Further, we provide a lower bound showing that a linear dependence on $t_\mathrm{mix}$ is necessary in the worst case for any algorithm which computes oblivious samples. We obtain our results by establishing connections between infinite-horizon average-reward MDPs and discounted MDPs of possible further utility.}
}

Endnote

%0 Conference Paper
%T Towards Tight Bounds on the Sample Complexity of Average-reward MDPs
%A Yujia Jin
%A Aaron Sidford
%B Proceedings of the 38th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2021
%E Marina Meila
%E Tong Zhang	
%F pmlr-v139-jin21b
%I PMLR
%P 5055--5064
%U https://proceedings.mlr.press/v139/jin21b.html
%V 139
%X We prove new upper and lower bounds for sample complexity of finding an $\epsilon$-optimal policy of an infinite-horizon average-reward Markov decision process (MDP) given access to a generative model. When the mixing time of the probability transition matrix of all policies is at most $t_\mathrm{mix}$, we provide an algorithm that solves the problem using $\widetilde{O}(t_\mathrm{mix} \epsilon^{-3})$ (oblivious) samples per state-action pair. Further, we provide a lower bound showing that a linear dependence on $t_\mathrm{mix}$ is necessary in the worst case for any algorithm which computes oblivious samples. We obtain our results by establishing connections between infinite-horizon average-reward MDPs and discounted MDPs of possible further utility.

APA


Jin, Y. & Sidford, A.. (2021). Towards Tight Bounds on the Sample Complexity of Average-reward MDPs. Proceedings of the 38th International Conference on Machine Learning, in Proceedings of Machine Learning Research 139:5055-5064 Available from https://proceedings.mlr.press/v139/jin21b.html.

Towards Tight Bounds on the Sample Complexity of Average-reward MDPs

Abstract

Cite this Paper

Related Material