CAB: Continuous Adaptive Blending for Policy Evaluation and Learning

Yi Su, Lequn Wang, Michele Santacatterina, Thorsten Joachims
Proceedings of the 36th International Conference on Machine Learning, PMLR 97:6005-6014, 2019.

Abstract

The ability to perform offline A/B-testing and off-policy learning using logged contextual bandit feedback is highly desirable in a broad range of applications, including recommender systems, search engines, ad placement, and personalized health care. Both offline A/B-testing and off-policy learning require a counterfactual estimator that evaluates how some new policy would have performed, if it had been used instead of the logging policy. In this paper, we identify a family of counterfactual estimators which subsumes most such estimators proposed to date. Our analysis of this family identifies a new estimator - called Continuous Adaptive Blending (CAB) - which enjoys many advantageous theoretical and practical properties. In particular, it can be substantially less biased than clipped Inverse Propensity Score (IPS) weighting and the Direct Method, and it can have less variance than Doubly Robust and IPS estimators. In addition, it is sub-differentiable such that it can be used for learning, unlike the SWITCH estimator. Experimental results show that CAB provides excellent evaluation accuracy and outperforms other counterfactual estimators in terms of learning performance.

Cite this Paper


BibTeX
@InProceedings{pmlr-v97-su19a, title = {{CAB}: Continuous Adaptive Blending for Policy Evaluation and Learning}, author = {Su, Yi and Wang, Lequn and Santacatterina, Michele and Joachims, Thorsten}, booktitle = {Proceedings of the 36th International Conference on Machine Learning}, pages = {6005--6014}, year = {2019}, editor = {Chaudhuri, Kamalika and Salakhutdinov, Ruslan}, volume = {97}, series = {Proceedings of Machine Learning Research}, month = {09--15 Jun}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v97/su19a/su19a.pdf}, url = {https://proceedings.mlr.press/v97/su19a.html}, abstract = {The ability to perform offline A/B-testing and off-policy learning using logged contextual bandit feedback is highly desirable in a broad range of applications, including recommender systems, search engines, ad placement, and personalized health care. Both offline A/B-testing and off-policy learning require a counterfactual estimator that evaluates how some new policy would have performed, if it had been used instead of the logging policy. In this paper, we identify a family of counterfactual estimators which subsumes most such estimators proposed to date. Our analysis of this family identifies a new estimator - called Continuous Adaptive Blending (CAB) - which enjoys many advantageous theoretical and practical properties. In particular, it can be substantially less biased than clipped Inverse Propensity Score (IPS) weighting and the Direct Method, and it can have less variance than Doubly Robust and IPS estimators. In addition, it is sub-differentiable such that it can be used for learning, unlike the SWITCH estimator. Experimental results show that CAB provides excellent evaluation accuracy and outperforms other counterfactual estimators in terms of learning performance.} }
Endnote
%0 Conference Paper %T CAB: Continuous Adaptive Blending for Policy Evaluation and Learning %A Yi Su %A Lequn Wang %A Michele Santacatterina %A Thorsten Joachims %B Proceedings of the 36th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2019 %E Kamalika Chaudhuri %E Ruslan Salakhutdinov %F pmlr-v97-su19a %I PMLR %P 6005--6014 %U https://proceedings.mlr.press/v97/su19a.html %V 97 %X The ability to perform offline A/B-testing and off-policy learning using logged contextual bandit feedback is highly desirable in a broad range of applications, including recommender systems, search engines, ad placement, and personalized health care. Both offline A/B-testing and off-policy learning require a counterfactual estimator that evaluates how some new policy would have performed, if it had been used instead of the logging policy. In this paper, we identify a family of counterfactual estimators which subsumes most such estimators proposed to date. Our analysis of this family identifies a new estimator - called Continuous Adaptive Blending (CAB) - which enjoys many advantageous theoretical and practical properties. In particular, it can be substantially less biased than clipped Inverse Propensity Score (IPS) weighting and the Direct Method, and it can have less variance than Doubly Robust and IPS estimators. In addition, it is sub-differentiable such that it can be used for learning, unlike the SWITCH estimator. Experimental results show that CAB provides excellent evaluation accuracy and outperforms other counterfactual estimators in terms of learning performance.
APA
Su, Y., Wang, L., Santacatterina, M. & Joachims, T.. (2019). CAB: Continuous Adaptive Blending for Policy Evaluation and Learning. Proceedings of the 36th International Conference on Machine Learning, in Proceedings of Machine Learning Research 97:6005-6014 Available from https://proceedings.mlr.press/v97/su19a.html.

Related Material