Model-Based Reinforcement Learning with a Generative Model is Minimax Optimal

Alekh Agarwal; Sham Kakade; Lin F. Yang

Model-Based Reinforcement Learning with a Generative Model is Minimax Optimal

Alekh Agarwal, Sham Kakade, Lin F. Yang

Proceedings of Thirty Third Conference on Learning Theory, PMLR 125:67-83, 2020.

Abstract

This work considers the sample and computational complexity of obtaining an $\epsilon$-optimal policy in a discounted Markov Decision Process (MDP), given only access to a generative model. In this model, the learner accesses the underlying transition model via a sampling oracle that provides a sample of the next state, when given any state-action pair as input. We are interested in a basic and unresolved question in model based planning: is this naïve “plug-in” approach — where we build the maximum likelihood estimate of the transition model in the MDP from observations and then find an optimal policy in this empirical MDP — non-asymptotically, minimax optimal? Our main result answers this question positively. With regards to computation, our result provides a simpler approach towards minimax optimal planning: in comparison to prior model-free results, we show that using \emph{any} high accuracy, black-box planning oracle in the empirical model suffices to obtain the minimax error rate. The key proof technique uses a leave-one-out analysis, in a novel “absorbing MDP” construction, to decouple the statistical dependency issues that arise in the analysis of model-based planning; this construction may be helpful more generally.

Cite this Paper

BibTeX

@InProceedings{pmlr-v125-agarwal20b,
  title = 	 {Model-Based Reinforcement Learning with a Generative Model is Minimax Optimal},
  author =       {Agarwal, Alekh and Kakade, Sham and Yang, Lin F.},
  booktitle = 	 {Proceedings of Thirty Third Conference on Learning Theory},
  pages = 	 {67--83},
  year = 	 {2020},
  editor = 	 {Abernethy, Jacob and Agarwal, Shivani},
  volume = 	 {125},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {09--12 Jul},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v125/agarwal20b/agarwal20b.pdf},
  url = 	 {https://proceedings.mlr.press/v125/agarwal20b.html},
  abstract = 	 { This work considers the sample and computational complexity of obtaining an $\epsilon$-optimal policy in a discounted Markov Decision Process (MDP), given only access to a generative model. In this model, the learner accesses the underlying transition model via a sampling oracle that provides a sample of the next state, when given any state-action pair as input. We are interested in a basic and unresolved question in model based planning: is this naïve “plug-in” approach — where we build the maximum likelihood estimate of the transition model in the MDP from observations and then find an optimal policy in this empirical MDP — non-asymptotically, minimax optimal? Our main result answers this question positively. With regards to computation, our result provides a simpler approach towards minimax optimal planning: in comparison to prior model-free results,  we show that using \emph{any} high accuracy, black-box planning oracle in the empirical model suffices to obtain the minimax error rate. The key proof technique uses a leave-one-out analysis, in a novel “absorbing MDP” construction, to decouple the statistical dependency issues that arise in the analysis of model-based planning; this construction may be helpful more generally.}
}

Endnote

%0 Conference Paper
%T Model-Based Reinforcement Learning with a Generative Model is Minimax Optimal
%A Alekh Agarwal
%A Sham Kakade
%A Lin F. Yang
%B Proceedings of Thirty Third Conference on Learning Theory
%C Proceedings of Machine Learning Research
%D 2020
%E Jacob Abernethy
%E Shivani Agarwal	
%F pmlr-v125-agarwal20b
%I PMLR
%P 67--83
%U https://proceedings.mlr.press/v125/agarwal20b.html
%V 125
%X  This work considers the sample and computational complexity of obtaining an $\epsilon$-optimal policy in a discounted Markov Decision Process (MDP), given only access to a generative model. In this model, the learner accesses the underlying transition model via a sampling oracle that provides a sample of the next state, when given any state-action pair as input. We are interested in a basic and unresolved question in model based planning: is this naïve “plug-in” approach — where we build the maximum likelihood estimate of the transition model in the MDP from observations and then find an optimal policy in this empirical MDP — non-asymptotically, minimax optimal? Our main result answers this question positively. With regards to computation, our result provides a simpler approach towards minimax optimal planning: in comparison to prior model-free results,  we show that using \emph{any} high accuracy, black-box planning oracle in the empirical model suffices to obtain the minimax error rate. The key proof technique uses a leave-one-out analysis, in a novel “absorbing MDP” construction, to decouple the statistical dependency issues that arise in the analysis of model-based planning; this construction may be helpful more generally.

APA

Agarwal, A., Kakade, S. & Yang, L.F.. (2020). Model-Based Reinforcement Learning with a Generative Model is Minimax Optimal. Proceedings of Thirty Third Conference on Learning Theory, in Proceedings of Machine Learning Research 125:67-83 Available from https://proceedings.mlr.press/v125/agarwal20b.html.

Related Material

Download PDF