An Empirical Analysis of Off-policy Learning in Discrete MDPs

Cosmin Păduraru, Doina Precup, Joelle Pineau, Gheorghe Comănici
Proceedings of the Tenth European Workshop on Reinforcement Learning, PMLR 24:89-102, 2013.

Abstract

Off-policy evaluation is the problem of evaluating a decision-making policy using data collected under a different behaviour policy. While several methods are available for addressing off-policy evaluation, little work has been done on identifying the best methods. In this paper, we conduct an in-depth comparative study of several off-policy evaluation methods in non-bandit, finite-horizon MDPs, using randomly generated MDPs, as well as a Mallard population dynamics model [Anderson, 1975] . We find that un-normalized importance sampling can exhibit prohibitively large variance in problems involving look-ahead longer than a few time steps, and that dynamic programming methods perform better than Monte-Carlo style methods.

Cite this Paper


BibTeX
@InProceedings{pmlr-v24-paduraru12a, title = {An Empirical Analysis of Off-policy Learning in Discrete MDPs}, author = {Păduraru, Cosmin and Precup, Doina and Pineau, Joelle and Comănici, Gheorghe}, booktitle = {Proceedings of the Tenth European Workshop on Reinforcement Learning}, pages = {89--102}, year = {2013}, editor = {Deisenroth, Marc Peter and Szepesvári, Csaba and Peters, Jan}, volume = {24}, series = {Proceedings of Machine Learning Research}, address = {Edinburgh, Scotland}, month = {30 Jun--01 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v24/paduraru12a/paduraru12a.pdf}, url = {https://proceedings.mlr.press/v24/paduraru12a.html}, abstract = {Off-policy evaluation is the problem of evaluating a decision-making policy using data collected under a different behaviour policy. While several methods are available for addressing off-policy evaluation, little work has been done on identifying the best methods. In this paper, we conduct an in-depth comparative study of several off-policy evaluation methods in non-bandit, finite-horizon MDPs, using randomly generated MDPs, as well as a Mallard population dynamics model [Anderson, 1975] . We find that un-normalized importance sampling can exhibit prohibitively large variance in problems involving look-ahead longer than a few time steps, and that dynamic programming methods perform better than Monte-Carlo style methods.} }
Endnote
%0 Conference Paper %T An Empirical Analysis of Off-policy Learning in Discrete MDPs %A Cosmin Păduraru %A Doina Precup %A Joelle Pineau %A Gheorghe Comănici %B Proceedings of the Tenth European Workshop on Reinforcement Learning %C Proceedings of Machine Learning Research %D 2013 %E Marc Peter Deisenroth %E Csaba Szepesvári %E Jan Peters %F pmlr-v24-paduraru12a %I PMLR %P 89--102 %U https://proceedings.mlr.press/v24/paduraru12a.html %V 24 %X Off-policy evaluation is the problem of evaluating a decision-making policy using data collected under a different behaviour policy. While several methods are available for addressing off-policy evaluation, little work has been done on identifying the best methods. In this paper, we conduct an in-depth comparative study of several off-policy evaluation methods in non-bandit, finite-horizon MDPs, using randomly generated MDPs, as well as a Mallard population dynamics model [Anderson, 1975] . We find that un-normalized importance sampling can exhibit prohibitively large variance in problems involving look-ahead longer than a few time steps, and that dynamic programming methods perform better than Monte-Carlo style methods.
RIS
TY - CPAPER TI - An Empirical Analysis of Off-policy Learning in Discrete MDPs AU - Cosmin Păduraru AU - Doina Precup AU - Joelle Pineau AU - Gheorghe Comănici BT - Proceedings of the Tenth European Workshop on Reinforcement Learning DA - 2013/01/12 ED - Marc Peter Deisenroth ED - Csaba Szepesvári ED - Jan Peters ID - pmlr-v24-paduraru12a PB - PMLR DP - Proceedings of Machine Learning Research VL - 24 SP - 89 EP - 102 L1 - http://proceedings.mlr.press/v24/paduraru12a/paduraru12a.pdf UR - https://proceedings.mlr.press/v24/paduraru12a.html AB - Off-policy evaluation is the problem of evaluating a decision-making policy using data collected under a different behaviour policy. While several methods are available for addressing off-policy evaluation, little work has been done on identifying the best methods. In this paper, we conduct an in-depth comparative study of several off-policy evaluation methods in non-bandit, finite-horizon MDPs, using randomly generated MDPs, as well as a Mallard population dynamics model [Anderson, 1975] . We find that un-normalized importance sampling can exhibit prohibitively large variance in problems involving look-ahead longer than a few time steps, and that dynamic programming methods perform better than Monte-Carlo style methods. ER -
APA
Păduraru, C., Precup, D., Pineau, J. & Comănici, G.. (2013). An Empirical Analysis of Off-policy Learning in Discrete MDPs. Proceedings of the Tenth European Workshop on Reinforcement Learning, in Proceedings of Machine Learning Research 24:89-102 Available from https://proceedings.mlr.press/v24/paduraru12a.html.

Related Material