Bayesian Counterfactual Risk Minimization
[edit]
Proceedings of the 36th International Conference on Machine Learning, PMLR 97:41254133, 2019.
Abstract
We present a Bayesian view of counterfactual risk minimization (CRM) for offline learning from logged bandit feedback. Using PACBayesian analysis, we derive a new generalization bound for the truncated inverse propensity score estimator. We apply the bound to a class of Bayesian policies, which motivates a novel, potentially datadependent, regularization technique for CRM. Experimental results indicate that this technique outperforms standard $L_2$ regularization, and that it is competitive with variance regularization while being both simpler to implement and more computationally efficient.
Related Material


