Adaptive Sampling for Best Policy Identification in Markov Decision Processes

Aymen Al Marjani; Alexandre Proutiere

Adaptive Sampling for Best Policy Identification in Markov Decision Processes

Aymen Al Marjani, Alexandre Proutiere

Proceedings of the 38th International Conference on Machine Learning, PMLR 139:7459-7468, 2021.

Abstract

We investigate the problem of best-policy identification in discounted Markov Decision Processes (MDPs) when the learner has access to a generative model. The objective is to devise a learning algorithm returning the best policy as early as possible. We first derive a problem-specific lower bound of the sample complexity satisfied by any learning algorithm. This lower bound corresponds to an optimal sample allocation that solves a non-convex program, and hence, is hard to exploit in the design of efficient algorithms. We then provide a simple and tight upper bound of the sample complexity lower bound, whose corresponding nearly-optimal sample allocation becomes explicit. The upper bound depends on specific functionals of the MDP such as the sub-optimality gaps and the variance of the next-state value function, and thus really captures the hardness of the MDP. Finally, we devise KLB-TS (KL Ball Track-and-Stop), an algorithm tracking this nearly-optimal allocation, and provide asymptotic guarantees for its sample complexity (both almost surely and in expectation). The advantages of KLB-TS against state-of-the-art algorithms are discussed and illustrated numerically.

Cite this Paper

BibTeX

@InProceedings{pmlr-v139-marjani21a,
  title = 	 {Adaptive Sampling for Best Policy Identification in Markov Decision Processes},
  author =       {Marjani, Aymen Al and Proutiere, Alexandre},
  booktitle = 	 {Proceedings of the 38th International Conference on Machine Learning},
  pages = 	 {7459--7468},
  year = 	 {2021},
  editor = 	 {Meila, Marina and Zhang, Tong},
  volume = 	 {139},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {18--24 Jul},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v139/marjani21a/marjani21a.pdf},
  url = 	 {https://proceedings.mlr.press/v139/marjani21a.html},
  abstract = 	 {We investigate the problem of best-policy identification in discounted Markov Decision Processes (MDPs) when the learner has access to a generative model. The objective is to devise a learning algorithm returning the best policy as early as possible. We first derive a problem-specific lower bound of the sample complexity satisfied by any learning algorithm. This lower bound corresponds to an optimal sample allocation that solves a non-convex program, and hence, is hard to exploit in the design of efficient algorithms. We then provide a simple and tight upper bound of the sample complexity lower bound, whose corresponding nearly-optimal sample allocation becomes explicit. The upper bound depends on specific functionals of the MDP such as the sub-optimality gaps and the variance of the next-state value function, and thus really captures the hardness of the MDP. Finally, we devise KLB-TS (KL Ball Track-and-Stop), an algorithm tracking this nearly-optimal allocation, and provide asymptotic guarantees for its sample complexity (both almost surely and in expectation). The advantages of KLB-TS against state-of-the-art algorithms are discussed and illustrated numerically.}
}

Endnote

%0 Conference Paper
%T Adaptive Sampling for Best Policy Identification in Markov Decision Processes
%A Aymen Al Marjani
%A Alexandre Proutiere
%B Proceedings of the 38th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2021
%E Marina Meila
%E Tong Zhang	
%F pmlr-v139-marjani21a
%I PMLR
%P 7459--7468
%U https://proceedings.mlr.press/v139/marjani21a.html
%V 139
%X We investigate the problem of best-policy identification in discounted Markov Decision Processes (MDPs) when the learner has access to a generative model. The objective is to devise a learning algorithm returning the best policy as early as possible. We first derive a problem-specific lower bound of the sample complexity satisfied by any learning algorithm. This lower bound corresponds to an optimal sample allocation that solves a non-convex program, and hence, is hard to exploit in the design of efficient algorithms. We then provide a simple and tight upper bound of the sample complexity lower bound, whose corresponding nearly-optimal sample allocation becomes explicit. The upper bound depends on specific functionals of the MDP such as the sub-optimality gaps and the variance of the next-state value function, and thus really captures the hardness of the MDP. Finally, we devise KLB-TS (KL Ball Track-and-Stop), an algorithm tracking this nearly-optimal allocation, and provide asymptotic guarantees for its sample complexity (both almost surely and in expectation). The advantages of KLB-TS against state-of-the-art algorithms are discussed and illustrated numerically.

APA

Marjani, A.A. & Proutiere, A.. (2021). Adaptive Sampling for Best Policy Identification in Markov Decision Processes. Proceedings of the 38th International Conference on Machine Learning, in Proceedings of Machine Learning Research 139:7459-7468 Available from https://proceedings.mlr.press/v139/marjani21a.html.

Adaptive Sampling for Best Policy Identification in Markov Decision Processes

Abstract

Cite this Paper

Related Material