Model-Free Monte Carlo-like Policy Evaluation

Raphael Fonteneau; Susan Murphy; Louis Wehenkel; Damien Ernst

Model-Free Monte Carlo-like Policy Evaluation

Raphael Fonteneau, Susan Murphy, Louis Wehenkel, Damien Ernst

Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, PMLR 9:217-224, 2010.

Abstract

We propose an algorithm for estimating the finite-horizon expected return of a closed loop control policy from an a priori given (off-policy) sample of one-step transitions. It averages cumulated rewards along a set of “broken trajectories” made of one-step transitions selected from the sample on the basis of the control policy. Under some Lipschitz continuity assumptions on the system dynamics, reward function and control policy, we provide bounds on the bias and variance of the estimator that depend only on the Lipschitz constants, on the number of broken trajectories used in the estimator, and on the sparsity of the sample of one-step transitions.

Cite this Paper

BibTeX


@InProceedings{pmlr-v9-fonteneau10a,
  title = 	 {Model-Free Monte Carlo-like Policy Evaluation},
  author = 	 {Fonteneau, Raphael and Murphy, Susan and Wehenkel, Louis and Ernst, Damien},
  booktitle = 	 {Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics},
  pages = 	 {217--224},
  year = 	 {2010},
  editor = 	 {Teh, Yee Whye and Titterington, Mike},
  volume = 	 {9},
  series = 	 {Proceedings of Machine Learning Research},
  address = 	 {Chia Laguna Resort, Sardinia, Italy},
  month = 	 {13--15 May},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v9/fonteneau10a/fonteneau10a.pdf},
  url = 	 {https://proceedings.mlr.press/v9/fonteneau10a.html},
  abstract = 	 {We propose an algorithm for estimating the finite-horizon expected return of a closed loop control policy from an a priori given (off-policy) sample of one-step transitions. It averages cumulated rewards along a set of “broken trajectories” made of one-step transitions selected from the sample on the basis of the control policy. Under some Lipschitz continuity assumptions on the system dynamics, reward function and control policy, we provide bounds on the bias and variance of the estimator that depend only on the Lipschitz constants, on the number of broken trajectories used in the estimator, and on the  sparsity of the sample of one-step transitions.}
}

Endnote

%0 Conference Paper
%T Model-Free Monte Carlo-like Policy Evaluation
%A Raphael Fonteneau
%A Susan Murphy
%A Louis Wehenkel
%A Damien Ernst
%B Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics
%C Proceedings of Machine Learning Research
%D 2010
%E Yee Whye Teh
%E Mike Titterington	
%F pmlr-v9-fonteneau10a
%I PMLR
%P 217--224
%U https://proceedings.mlr.press/v9/fonteneau10a.html
%V 9
%X We propose an algorithm for estimating the finite-horizon expected return of a closed loop control policy from an a priori given (off-policy) sample of one-step transitions. It averages cumulated rewards along a set of “broken trajectories” made of one-step transitions selected from the sample on the basis of the control policy. Under some Lipschitz continuity assumptions on the system dynamics, reward function and control policy, we provide bounds on the bias and variance of the estimator that depend only on the Lipschitz constants, on the number of broken trajectories used in the estimator, and on the  sparsity of the sample of one-step transitions.

RIS


TY  - CPAPER
TI  - Model-Free Monte Carlo-like Policy Evaluation
AU  - Raphael Fonteneau
AU  - Susan Murphy
AU  - Louis Wehenkel
AU  - Damien Ernst
BT  - Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics
DA  - 2010/03/31
ED  - Yee Whye Teh
ED  - Mike Titterington	
ID  - pmlr-v9-fonteneau10a
PB  - PMLR
DP  - Proceedings of Machine Learning Research
VL  - 9
SP  - 217
EP  - 224
L1  - http://proceedings.mlr.press/v9/fonteneau10a/fonteneau10a.pdf
UR  - https://proceedings.mlr.press/v9/fonteneau10a.html
AB  - We propose an algorithm for estimating the finite-horizon expected return of a closed loop control policy from an a priori given (off-policy) sample of one-step transitions. It averages cumulated rewards along a set of “broken trajectories” made of one-step transitions selected from the sample on the basis of the control policy. Under some Lipschitz continuity assumptions on the system dynamics, reward function and control policy, we provide bounds on the bias and variance of the estimator that depend only on the Lipschitz constants, on the number of broken trajectories used in the estimator, and on the  sparsity of the sample of one-step transitions.
ER  -

APA


Fonteneau, R., Murphy, S., Wehenkel, L. & Ernst, D.. (2010). Model-Free Monte Carlo-like Policy Evaluation. Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 9:217-224 Available from https://proceedings.mlr.press/v9/fonteneau10a.html.

Related Material

Download PDF