An Empirical Analysis of Off-policy Learning in Discrete MDPs

Cosmin Păduraru; Doina Precup; Joelle Pineau; Gheorghe Comănici

An Empirical Analysis of Off-policy Learning in Discrete MDPs

Cosmin Păduraru, Doina Precup, Joelle Pineau, Gheorghe Comănici

Proceedings of the Tenth European Workshop on Reinforcement Learning, PMLR 24:89-102, 2013.

Abstract

Off-policy evaluation is the problem of evaluating a decision-making policy using data collected under a different behaviour policy. While several methods are available for addressing off-policy evaluation, little work has been done on identifying the best methods. In this paper, we conduct an in-depth comparative study of several off-policy evaluation methods in non-bandit, finite-horizon MDPs, using randomly generated MDPs, as well as a Mallard population dynamics model [Anderson, 1975] . We find that un-normalized importance sampling can exhibit prohibitively large variance in problems involving look-ahead longer than a few time steps, and that dynamic programming methods perform better than Monte-Carlo style methods.

Cite this Paper

BibTeX


@InProceedings{pmlr-v24-paduraru12a,
  title = 	 {An Empirical Analysis of Off-policy Learning in Discrete MDPs},
  author = 	 {Păduraru, Cosmin and Precup, Doina and Pineau, Joelle and Comănici, Gheorghe},
  booktitle = 	 {Proceedings of the Tenth European Workshop on Reinforcement Learning},
  pages = 	 {89--102},
  year = 	 {2013},
  editor = 	 {Deisenroth, Marc Peter and Szepesvári, Csaba and Peters, Jan},
  volume = 	 {24},
  series = 	 {Proceedings of Machine Learning Research},
  address = 	 {Edinburgh, Scotland},
  month = 	 {30 Jun--01 Jul},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v24/paduraru12a/paduraru12a.pdf},
  url = 	 {https://proceedings.mlr.press/v24/paduraru12a.html},
  abstract = 	 {Off-policy evaluation is the problem of evaluating a decision-making policy using data collected under a different behaviour policy. While several methods are available for addressing off-policy evaluation, little work has been done on identifying the best methods. In this paper, we conduct an in-depth comparative study of several off-policy evaluation methods in non-bandit, finite-horizon MDPs, using randomly generated MDPs, as well as a Mallard population dynamics model [Anderson, 1975] . We find that un-normalized importance sampling can exhibit prohibitively large variance in problems involving look-ahead longer than a few time steps, and that dynamic programming methods perform better than Monte-Carlo style methods.}
}

Endnote

%0 Conference Paper
%T An Empirical Analysis of Off-policy Learning in Discrete MDPs
%A Cosmin Păduraru
%A Doina Precup
%A Joelle Pineau
%A Gheorghe Comănici
%B Proceedings of the Tenth European Workshop on Reinforcement Learning
%C Proceedings of Machine Learning Research
%D 2013
%E Marc Peter Deisenroth
%E Csaba Szepesvári
%E Jan Peters	
%F pmlr-v24-paduraru12a
%I PMLR
%P 89--102
%U https://proceedings.mlr.press/v24/paduraru12a.html
%V 24
%X Off-policy evaluation is the problem of evaluating a decision-making policy using data collected under a different behaviour policy. While several methods are available for addressing off-policy evaluation, little work has been done on identifying the best methods. In this paper, we conduct an in-depth comparative study of several off-policy evaluation methods in non-bandit, finite-horizon MDPs, using randomly generated MDPs, as well as a Mallard population dynamics model [Anderson, 1975] . We find that un-normalized importance sampling can exhibit prohibitively large variance in problems involving look-ahead longer than a few time steps, and that dynamic programming methods perform better than Monte-Carlo style methods.

RIS


TY  - CPAPER
TI  - An Empirical Analysis of Off-policy Learning in Discrete MDPs
AU  - Cosmin Păduraru
AU  - Doina Precup
AU  - Joelle Pineau
AU  - Gheorghe Comănici
BT  - Proceedings of the Tenth European Workshop on Reinforcement Learning
DA  - 2013/01/12
ED  - Marc Peter Deisenroth
ED  - Csaba Szepesvári
ED  - Jan Peters	
ID  - pmlr-v24-paduraru12a
PB  - PMLR
DP  - Proceedings of Machine Learning Research
VL  - 24
SP  - 89
EP  - 102
L1  - http://proceedings.mlr.press/v24/paduraru12a/paduraru12a.pdf
UR  - https://proceedings.mlr.press/v24/paduraru12a.html
AB  - Off-policy evaluation is the problem of evaluating a decision-making policy using data collected under a different behaviour policy. While several methods are available for addressing off-policy evaluation, little work has been done on identifying the best methods. In this paper, we conduct an in-depth comparative study of several off-policy evaluation methods in non-bandit, finite-horizon MDPs, using randomly generated MDPs, as well as a Mallard population dynamics model [Anderson, 1975] . We find that un-normalized importance sampling can exhibit prohibitively large variance in problems involving look-ahead longer than a few time steps, and that dynamic programming methods perform better than Monte-Carlo style methods.
ER  -

APA


Păduraru, C., Precup, D., Pineau, J. & Comănici, G.. (2013). An Empirical Analysis of Off-policy Learning in Discrete MDPs. Proceedings of the Tenth European Workshop on Reinforcement Learning, in Proceedings of Machine Learning Research 24:89-102 Available from https://proceedings.mlr.press/v24/paduraru12a.html.

Related Material

Download PDF