Variance Regularized Counterfactual Risk Minimization via Variational Divergence Minimization

Hang Wu, May Wang
Proceedings of the 35th International Conference on Machine Learning, PMLR 80:5353-5362, 2018.

Abstract

Off-policy learning, the task of evaluating and improving policies using historic data collected from a logging policy, is important because on-policy evaluation is usually expensive and has adverse impacts. One of the major challenge of off-policy learning is to derive counterfactual estimators that also has low variance and thus low generalization error. In this work, inspired by learning bounds for importance sampling problems, we present a new counterfactual learning principle for off-policy learning with bandit feedbacks. Our method regularizes the generalization error by minimizing the distribution divergence between the logging policy and the new policy, and removes the need for iterating through all training samples to compute sample variance regularization in prior work. With neural network policies, our end-to-end training algorithms using variational divergence minimization showed significant improvement over conventional baseline algorithms and is also consistent with our theoretical results.

Cite this Paper


BibTeX
@InProceedings{pmlr-v80-wu18g, title = {Variance Regularized Counterfactual Risk Minimization via Variational Divergence Minimization}, author = {Wu, Hang and Wang, May}, booktitle = {Proceedings of the 35th International Conference on Machine Learning}, pages = {5353--5362}, year = {2018}, editor = {Dy, Jennifer and Krause, Andreas}, volume = {80}, series = {Proceedings of Machine Learning Research}, month = {10--15 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v80/wu18g/wu18g.pdf}, url = {https://proceedings.mlr.press/v80/wu18g.html}, abstract = {Off-policy learning, the task of evaluating and improving policies using historic data collected from a logging policy, is important because on-policy evaluation is usually expensive and has adverse impacts. One of the major challenge of off-policy learning is to derive counterfactual estimators that also has low variance and thus low generalization error. In this work, inspired by learning bounds for importance sampling problems, we present a new counterfactual learning principle for off-policy learning with bandit feedbacks. Our method regularizes the generalization error by minimizing the distribution divergence between the logging policy and the new policy, and removes the need for iterating through all training samples to compute sample variance regularization in prior work. With neural network policies, our end-to-end training algorithms using variational divergence minimization showed significant improvement over conventional baseline algorithms and is also consistent with our theoretical results.} }
Endnote
%0 Conference Paper %T Variance Regularized Counterfactual Risk Minimization via Variational Divergence Minimization %A Hang Wu %A May Wang %B Proceedings of the 35th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2018 %E Jennifer Dy %E Andreas Krause %F pmlr-v80-wu18g %I PMLR %P 5353--5362 %U https://proceedings.mlr.press/v80/wu18g.html %V 80 %X Off-policy learning, the task of evaluating and improving policies using historic data collected from a logging policy, is important because on-policy evaluation is usually expensive and has adverse impacts. One of the major challenge of off-policy learning is to derive counterfactual estimators that also has low variance and thus low generalization error. In this work, inspired by learning bounds for importance sampling problems, we present a new counterfactual learning principle for off-policy learning with bandit feedbacks. Our method regularizes the generalization error by minimizing the distribution divergence between the logging policy and the new policy, and removes the need for iterating through all training samples to compute sample variance regularization in prior work. With neural network policies, our end-to-end training algorithms using variational divergence minimization showed significant improvement over conventional baseline algorithms and is also consistent with our theoretical results.
APA
Wu, H. & Wang, M.. (2018). Variance Regularized Counterfactual Risk Minimization via Variational Divergence Minimization. Proceedings of the 35th International Conference on Machine Learning, in Proceedings of Machine Learning Research 80:5353-5362 Available from https://proceedings.mlr.press/v80/wu18g.html.

Related Material