Multiple-policy High-confidence Policy Evaluation

Chris Dann; Mohammad Ghavamzadeh; Teodor V. Marinov

Multiple-policy High-confidence Policy Evaluation

Chris Dann, Mohammad Ghavamzadeh, Teodor V. Marinov

Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, PMLR 206:9470-9487, 2023.

Abstract

In reinforcement learning applications, we often want to accurately estimate the return of several policies of interest. We study this problem, multiple-policy high-confidence policy evaluation, where the goal is to estimate the return of all given target policies up to a desired accuracy with as few samples as possible. The natural approaches to this problem, i.e., evaluating each policy separately or estimating a model of the MDP, do not take into account the similarities between target policies and scale with the number of policies to evaluate or the size of the MDP, respectively. We present an alternative approach based on reusing samples from on-policy Monte-Carlo estimators and show that it is more sample-efficient in favorable cases. Specifically, we provide guarantees in terms of a notion of overlap of the set of target policies and shed light on when such an approach is indeed beneficial compared to existing methods.

Cite this Paper

BibTeX


@InProceedings{pmlr-v206-dann23a,
  title = 	 {Multiple-policy High-confidence Policy Evaluation},
  author =       {Dann, Chris and Ghavamzadeh, Mohammad and Marinov, Teodor V.},
  booktitle = 	 {Proceedings of The 26th International Conference on Artificial Intelligence and Statistics},
  pages = 	 {9470--9487},
  year = 	 {2023},
  editor = 	 {Ruiz, Francisco and Dy, Jennifer and van de Meent, Jan-Willem},
  volume = 	 {206},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {25--27 Apr},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v206/dann23a/dann23a.pdf},
  url = 	 {https://proceedings.mlr.press/v206/dann23a.html},
  abstract = 	 {In reinforcement learning applications, we often want to accurately estimate the return of several policies of interest. We study this problem, multiple-policy high-confidence policy evaluation, where the goal is to estimate the return of all given target policies up to a desired accuracy with as few samples as possible. The natural approaches to this problem, i.e., evaluating each policy separately or estimating a model of the MDP, do not take into account the similarities between target policies and scale with the number of policies to evaluate or the size of the MDP, respectively. We present an alternative approach based on reusing samples from on-policy Monte-Carlo estimators and show that it is more sample-efficient in favorable cases. Specifically, we provide guarantees in terms of a notion of overlap of the set of target policies and shed light on when such an approach is indeed beneficial compared to existing methods.}
}

Endnote

%0 Conference Paper
%T Multiple-policy High-confidence Policy Evaluation
%A Chris Dann
%A Mohammad Ghavamzadeh
%A Teodor V. Marinov
%B Proceedings of The 26th International Conference on Artificial Intelligence and Statistics
%C Proceedings of Machine Learning Research
%D 2023
%E Francisco Ruiz
%E Jennifer Dy
%E Jan-Willem van de Meent	
%F pmlr-v206-dann23a
%I PMLR
%P 9470--9487
%U https://proceedings.mlr.press/v206/dann23a.html
%V 206
%X In reinforcement learning applications, we often want to accurately estimate the return of several policies of interest. We study this problem, multiple-policy high-confidence policy evaluation, where the goal is to estimate the return of all given target policies up to a desired accuracy with as few samples as possible. The natural approaches to this problem, i.e., evaluating each policy separately or estimating a model of the MDP, do not take into account the similarities between target policies and scale with the number of policies to evaluate or the size of the MDP, respectively. We present an alternative approach based on reusing samples from on-policy Monte-Carlo estimators and show that it is more sample-efficient in favorable cases. Specifically, we provide guarantees in terms of a notion of overlap of the set of target policies and shed light on when such an approach is indeed beneficial compared to existing methods.

APA


Dann, C., Ghavamzadeh, M. & Marinov, T.V.. (2023). Multiple-policy High-confidence Policy Evaluation. Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 206:9470-9487 Available from https://proceedings.mlr.press/v206/dann23a.html.

Related Material

Download PDF