Efficiently Solving MDPs with Stochastic Mirror Descent

Yujia Jin; Aaron Sidford

Efficiently Solving MDPs with Stochastic Mirror Descent

Yujia Jin, Aaron Sidford

Proceedings of the 37th International Conference on Machine Learning, PMLR 119:4890-4900, 2020.

Abstract

We present a unified framework based on primal-dual stochastic mirror descent for approximately solving infinite-horizon Markov decision processes (MDPs) given a generative model. When applied to an average-reward MDP with

$A_{tot}$ total actions and mixing time bound

$t_{mix}$ our method computes an

$\epsilon$ -optimal policy with an expected

$\widetilde{O}(t_{mix}^2 A_{tot} \epsilon^{-2})$ samples from the state-transition matrix, removing the ergodicity dependence of prior art. When applied to a

$\gamma$ -discounted MDP with

$A_{tot}$ total actions our method computes an

$\epsilon$ -optimal policy with an expected

$\widetilde{O}((1-\gamma)^{-4} A_{tot} \epsilon^{-2})$ samples, improving over the best-known primal-dual methods while matching the state-of-the-art up to a

$(1-\gamma)^{-1}$ factor. Both methods are model-free, update state values and policies simultaneously, and run in time linear in the number of samples taken. We achieve these results through a more general stochastic mirror descent framework for solving bilinear saddle-point problems with simplex and box domains and we demonstrate the flexibility of this framework by providing further applications to constrained MDPs.

Cite this Paper

BibTeX


@InProceedings{pmlr-v119-jin20f,
  title = 	 {Efficiently Solving {MDP}s with Stochastic Mirror Descent},
  author =       {Jin, Yujia and Sidford, Aaron},
  booktitle = 	 {Proceedings of the 37th International Conference on Machine Learning},
  pages = 	 {4890--4900},
  year = 	 {2020},
  editor = 	 {III, Hal Daumé and Singh, Aarti},
  volume = 	 {119},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {13--18 Jul},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v119/jin20f/jin20f.pdf},
  url = 	 {https://proceedings.mlr.press/v119/jin20f.html},
  abstract = 	 {We present a unified framework based on primal-dual stochastic mirror descent for approximately solving infinite-horizon Markov decision processes (MDPs) given a generative model. When applied to an average-reward MDP with $A_{tot}$ total actions and mixing time bound $t_{mix}$ our method computes an $\epsilon$-optimal policy with an expected $\widetilde{O}(t_{mix}^2 A_{tot} \epsilon^{-2})$ samples from the state-transition matrix, removing the ergodicity dependence of prior art. When applied to a $\gamma$-discounted MDP with $A_{tot}$ total actions our method computes an $\epsilon$-optimal policy with an expected $\widetilde{O}((1-\gamma)^{-4} A_{tot} \epsilon^{-2})$ samples, improving over the best-known primal-dual methods while matching the state-of-the-art up to a $(1-\gamma)^{-1}$ factor. Both methods are model-free, update state values and policies simultaneously, and run in time linear in the number of samples taken. We achieve these results through a more general stochastic mirror descent framework for solving bilinear saddle-point problems with simplex and box domains and we demonstrate the flexibility of this framework by providing further applications to constrained MDPs.}
}

Endnote

%0 Conference Paper
%T Efficiently Solving MDPs with Stochastic Mirror Descent
%A Yujia Jin
%A Aaron Sidford
%B Proceedings of the 37th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2020
%E Hal Daumé III
%E Aarti Singh	
%F pmlr-v119-jin20f
%I PMLR
%P 4890--4900
%U https://proceedings.mlr.press/v119/jin20f.html
%V 119
%X We present a unified framework based on primal-dual stochastic mirror descent for approximately solving infinite-horizon Markov decision processes (MDPs) given a generative model. When applied to an average-reward MDP with $A_{tot}$ total actions and mixing time bound $t_{mix}$ our method computes an $\epsilon$-optimal policy with an expected $\widetilde{O}(t_{mix}^2 A_{tot} \epsilon^{-2})$ samples from the state-transition matrix, removing the ergodicity dependence of prior art. When applied to a $\gamma$-discounted MDP with $A_{tot}$ total actions our method computes an $\epsilon$-optimal policy with an expected $\widetilde{O}((1-\gamma)^{-4} A_{tot} \epsilon^{-2})$ samples, improving over the best-known primal-dual methods while matching the state-of-the-art up to a $(1-\gamma)^{-1}$ factor. Both methods are model-free, update state values and policies simultaneously, and run in time linear in the number of samples taken. We achieve these results through a more general stochastic mirror descent framework for solving bilinear saddle-point problems with simplex and box domains and we demonstrate the flexibility of this framework by providing further applications to constrained MDPs.

APA


Jin, Y. & Sidford, A.. (2020). Efficiently Solving MDPs with Stochastic Mirror Descent. Proceedings of the 37th International Conference on Machine Learning, in Proceedings of Machine Learning Research 119:4890-4900 Available from https://proceedings.mlr.press/v119/jin20f.html.

Efficiently Solving MDPs with Stochastic Mirror Descent

Abstract

Cite this Paper

Related Material