Model-Free Monte Carlo-like Policy Evaluation

Raphael Fonteneau, Susan Murphy, Louis Wehenkel, Damien Ernst
Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, PMLR 9:217-224, 2010.

Abstract

We propose an algorithm for estimating the finite-horizon expected return of a closed loop control policy from an a priori given (off-policy) sample of one-step transitions. It averages cumulated rewards along a set of “broken trajectories” made of one-step transitions selected from the sample on the basis of the control policy. Under some Lipschitz continuity assumptions on the system dynamics, reward function and control policy, we provide bounds on the bias and variance of the estimator that depend only on the Lipschitz constants, on the number of broken trajectories used in the estimator, and on the sparsity of the sample of one-step transitions.

Cite this Paper


BibTeX
@InProceedings{pmlr-v9-fonteneau10a, title = {Model-Free Monte Carlo-like Policy Evaluation}, author = {Fonteneau, Raphael and Murphy, Susan and Wehenkel, Louis and Ernst, Damien}, booktitle = {Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics}, pages = {217--224}, year = {2010}, editor = {Teh, Yee Whye and Titterington, Mike}, volume = {9}, series = {Proceedings of Machine Learning Research}, address = {Chia Laguna Resort, Sardinia, Italy}, month = {13--15 May}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v9/fonteneau10a/fonteneau10a.pdf}, url = {https://proceedings.mlr.press/v9/fonteneau10a.html}, abstract = {We propose an algorithm for estimating the finite-horizon expected return of a closed loop control policy from an a priori given (off-policy) sample of one-step transitions. It averages cumulated rewards along a set of “broken trajectories” made of one-step transitions selected from the sample on the basis of the control policy. Under some Lipschitz continuity assumptions on the system dynamics, reward function and control policy, we provide bounds on the bias and variance of the estimator that depend only on the Lipschitz constants, on the number of broken trajectories used in the estimator, and on the sparsity of the sample of one-step transitions.} }
Endnote
%0 Conference Paper %T Model-Free Monte Carlo-like Policy Evaluation %A Raphael Fonteneau %A Susan Murphy %A Louis Wehenkel %A Damien Ernst %B Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 2010 %E Yee Whye Teh %E Mike Titterington %F pmlr-v9-fonteneau10a %I PMLR %P 217--224 %U https://proceedings.mlr.press/v9/fonteneau10a.html %V 9 %X We propose an algorithm for estimating the finite-horizon expected return of a closed loop control policy from an a priori given (off-policy) sample of one-step transitions. It averages cumulated rewards along a set of “broken trajectories” made of one-step transitions selected from the sample on the basis of the control policy. Under some Lipschitz continuity assumptions on the system dynamics, reward function and control policy, we provide bounds on the bias and variance of the estimator that depend only on the Lipschitz constants, on the number of broken trajectories used in the estimator, and on the sparsity of the sample of one-step transitions.
RIS
TY - CPAPER TI - Model-Free Monte Carlo-like Policy Evaluation AU - Raphael Fonteneau AU - Susan Murphy AU - Louis Wehenkel AU - Damien Ernst BT - Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics DA - 2010/03/31 ED - Yee Whye Teh ED - Mike Titterington ID - pmlr-v9-fonteneau10a PB - PMLR DP - Proceedings of Machine Learning Research VL - 9 SP - 217 EP - 224 L1 - http://proceedings.mlr.press/v9/fonteneau10a/fonteneau10a.pdf UR - https://proceedings.mlr.press/v9/fonteneau10a.html AB - We propose an algorithm for estimating the finite-horizon expected return of a closed loop control policy from an a priori given (off-policy) sample of one-step transitions. It averages cumulated rewards along a set of “broken trajectories” made of one-step transitions selected from the sample on the basis of the control policy. Under some Lipschitz continuity assumptions on the system dynamics, reward function and control policy, we provide bounds on the bias and variance of the estimator that depend only on the Lipschitz constants, on the number of broken trajectories used in the estimator, and on the sparsity of the sample of one-step transitions. ER -
APA
Fonteneau, R., Murphy, S., Wehenkel, L. & Ernst, D.. (2010). Model-Free Monte Carlo-like Policy Evaluation. Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 9:217-224 Available from https://proceedings.mlr.press/v9/fonteneau10a.html.

Related Material