Optimal Off-Policy Evaluation from Multiple Logging Policies

Nathan Kallus, Yuta Saito, Masatoshi Uehara
Proceedings of the 38th International Conference on Machine Learning, PMLR 139:5247-5256, 2021.

Abstract

We study off-policy evaluation (OPE) from multiple logging policies, each generating a dataset of fixed size, i.e., stratified sampling. Previous work noted that in this setting the ordering of the variances of different importance sampling estimators is instance-dependent, which brings up a dilemma as to which importance sampling weights to use. In this paper, we resolve this dilemma by finding the OPE estimator for multiple loggers with minimum variance for any instance, i.e., the efficient one. In particular, we establish the efficiency bound under stratified sampling and propose an estimator achieving this bound when given consistent $q$-estimates. To guard against misspecification of $q$-functions, we also provide a way to choose the control variate in a hypothesis class to minimize variance. Extensive experiments demonstrate the benefits of our methods’ efficiently leveraging of the stratified sampling of off-policy data from multiple loggers.

Cite this Paper


BibTeX
@InProceedings{pmlr-v139-kallus21a, title = {Optimal Off-Policy Evaluation from Multiple Logging Policies}, author = {Kallus, Nathan and Saito, Yuta and Uehara, Masatoshi}, booktitle = {Proceedings of the 38th International Conference on Machine Learning}, pages = {5247--5256}, year = {2021}, editor = {Meila, Marina and Zhang, Tong}, volume = {139}, series = {Proceedings of Machine Learning Research}, month = {18--24 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v139/kallus21a/kallus21a.pdf}, url = {https://proceedings.mlr.press/v139/kallus21a.html}, abstract = {We study off-policy evaluation (OPE) from multiple logging policies, each generating a dataset of fixed size, i.e., stratified sampling. Previous work noted that in this setting the ordering of the variances of different importance sampling estimators is instance-dependent, which brings up a dilemma as to which importance sampling weights to use. In this paper, we resolve this dilemma by finding the OPE estimator for multiple loggers with minimum variance for any instance, i.e., the efficient one. In particular, we establish the efficiency bound under stratified sampling and propose an estimator achieving this bound when given consistent $q$-estimates. To guard against misspecification of $q$-functions, we also provide a way to choose the control variate in a hypothesis class to minimize variance. Extensive experiments demonstrate the benefits of our methods’ efficiently leveraging of the stratified sampling of off-policy data from multiple loggers.} }
Endnote
%0 Conference Paper %T Optimal Off-Policy Evaluation from Multiple Logging Policies %A Nathan Kallus %A Yuta Saito %A Masatoshi Uehara %B Proceedings of the 38th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2021 %E Marina Meila %E Tong Zhang %F pmlr-v139-kallus21a %I PMLR %P 5247--5256 %U https://proceedings.mlr.press/v139/kallus21a.html %V 139 %X We study off-policy evaluation (OPE) from multiple logging policies, each generating a dataset of fixed size, i.e., stratified sampling. Previous work noted that in this setting the ordering of the variances of different importance sampling estimators is instance-dependent, which brings up a dilemma as to which importance sampling weights to use. In this paper, we resolve this dilemma by finding the OPE estimator for multiple loggers with minimum variance for any instance, i.e., the efficient one. In particular, we establish the efficiency bound under stratified sampling and propose an estimator achieving this bound when given consistent $q$-estimates. To guard against misspecification of $q$-functions, we also provide a way to choose the control variate in a hypothesis class to minimize variance. Extensive experiments demonstrate the benefits of our methods’ efficiently leveraging of the stratified sampling of off-policy data from multiple loggers.
APA
Kallus, N., Saito, Y. & Uehara, M.. (2021). Optimal Off-Policy Evaluation from Multiple Logging Policies. Proceedings of the 38th International Conference on Machine Learning, in Proceedings of Machine Learning Research 139:5247-5256 Available from https://proceedings.mlr.press/v139/kallus21a.html.

Related Material