Importance Sampling Policy Evaluation with an Estimated Behavior Policy

Josiah Hanna; Scott Niekum; Peter Stone

Importance Sampling Policy Evaluation with an Estimated Behavior Policy

Josiah Hanna, Scott Niekum, Peter Stone

Proceedings of the 36th International Conference on Machine Learning, PMLR 97:2605-2613, 2019.

Abstract

We consider the problem of off-policy evaluation in Markov decision processes. Off-policy evaluation is the task of evaluating the expected return of one policy with data generated by a different, behavior policy. Importance sampling is a technique for off-policy evaluation that re-weights off-policy returns to account for differences in the likelihood of the returns between the two policies. In this paper, we study importance sampling with an estimated behavior policy where the behavior policy estimate comes from the same set of data used to compute the importance sampling estimate. We find that this estimator often lowers the mean squared error of off-policy evaluation compared to importance sampling with the true behavior policy or using a behavior policy that is estimated from a separate data set. Intuitively, estimating the behavior policy in this way corrects for error due to sampling in the action-space. Our empirical results also extend to other popular variants of importance sampling and show that estimating a non-Markovian behavior policy can further lower large-sample mean squared error even when the true behavior policy is Markovian.

Cite this Paper

BibTeX

@InProceedings{pmlr-v97-hanna19a,
  title = 	 {Importance Sampling Policy Evaluation with an Estimated Behavior Policy},
  author =       {Hanna, Josiah and Niekum, Scott and Stone, Peter},
  booktitle = 	 {Proceedings of the 36th International Conference on Machine Learning},
  pages = 	 {2605--2613},
  year = 	 {2019},
  editor = 	 {Chaudhuri, Kamalika and Salakhutdinov, Ruslan},
  volume = 	 {97},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {09--15 Jun},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v97/hanna19a/hanna19a.pdf},
  url = 	 {https://proceedings.mlr.press/v97/hanna19a.html},
  abstract = 	 {We consider the problem of off-policy evaluation in Markov decision processes. Off-policy evaluation is the task of evaluating the expected return of one policy with data generated by a different, behavior policy. Importance sampling is a technique for off-policy evaluation that re-weights off-policy returns to account for differences in the likelihood of the returns between the two policies. In this paper, we study importance sampling with an estimated behavior policy where the behavior policy estimate comes from the same set of data used to compute the importance sampling estimate. We find that this estimator often lowers the mean squared error of off-policy evaluation compared to importance sampling with the true behavior policy or using a behavior policy that is estimated from a separate data set. Intuitively, estimating the behavior policy in this way corrects for error due to sampling in the action-space. Our empirical results also extend to other popular variants of importance sampling and show that estimating a non-Markovian behavior policy can further lower large-sample mean squared error even when the true behavior policy is Markovian.}
}

Endnote

%0 Conference Paper
%T Importance Sampling Policy Evaluation with an Estimated Behavior Policy
%A Josiah Hanna
%A Scott Niekum
%A Peter Stone
%B Proceedings of the 36th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2019
%E Kamalika Chaudhuri
%E Ruslan Salakhutdinov	
%F pmlr-v97-hanna19a
%I PMLR
%P 2605--2613
%U https://proceedings.mlr.press/v97/hanna19a.html
%V 97
%X We consider the problem of off-policy evaluation in Markov decision processes. Off-policy evaluation is the task of evaluating the expected return of one policy with data generated by a different, behavior policy. Importance sampling is a technique for off-policy evaluation that re-weights off-policy returns to account for differences in the likelihood of the returns between the two policies. In this paper, we study importance sampling with an estimated behavior policy where the behavior policy estimate comes from the same set of data used to compute the importance sampling estimate. We find that this estimator often lowers the mean squared error of off-policy evaluation compared to importance sampling with the true behavior policy or using a behavior policy that is estimated from a separate data set. Intuitively, estimating the behavior policy in this way corrects for error due to sampling in the action-space. Our empirical results also extend to other popular variants of importance sampling and show that estimating a non-Markovian behavior policy can further lower large-sample mean squared error even when the true behavior policy is Markovian.

APA

Hanna, J., Niekum, S. & Stone, P.. (2019). Importance Sampling Policy Evaluation with an Estimated Behavior Policy. Proceedings of the 36th International Conference on Machine Learning, in Proceedings of Machine Learning Research 97:2605-2613 Available from https://proceedings.mlr.press/v97/hanna19a.html.

Importance Sampling Policy Evaluation with an Estimated Behavior Policy

Abstract

Cite this Paper

Related Material