Counterfactual Risk Minimization: Learning from Logged Bandit Feedback

Adith Swaminathan; Thorsten Joachims

Counterfactual Risk Minimization: Learning from Logged Bandit Feedback

Adith Swaminathan, Thorsten Joachims

Proceedings of the 32nd International Conference on Machine Learning, PMLR 37:814-823, 2015.

Abstract

We develop a learning principle and an efficient algorithm for batch learning from logged bandit feedback. This learning setting is ubiquitous in online systems (e.g., ad placement, web search, recommendation), where an algorithm makes a prediction (e.g., ad ranking) for a given input (e.g., query) and observes bandit feedback (e.g., user clicks on presented ads). We first address the counterfactual nature of the learning problem through propensity scoring. Next, we prove generalization error bounds that account for the variance of the propensity-weighted empirical risk estimator. These constructive bounds give rise to the Counterfactual Risk Minimization (CRM) principle. We show how CRM can be used to derive a new learning method – called Policy Optimizer for Exponential Models (POEM) – for learning stochastic linear rules for structured output prediction. We present a decomposition of the POEM objective that enables efficient stochastic gradient optimization. POEM is evaluated on several multi-label classification problems showing substantially improved robustness and generalization performance compared to the state-of-the-art.

Cite this Paper

BibTeX


@InProceedings{pmlr-v37-swaminathan15,
  title = 	 {Counterfactual Risk Minimization: Learning from Logged Bandit Feedback},
  author = 	 {Swaminathan, Adith and Joachims, Thorsten},
  booktitle = 	 {Proceedings of the 32nd International Conference on Machine Learning},
  pages = 	 {814--823},
  year = 	 {2015},
  editor = 	 {Bach, Francis and Blei, David},
  volume = 	 {37},
  series = 	 {Proceedings of Machine Learning Research},
  address = 	 {Lille, France},
  month = 	 {07--09 Jul},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v37/swaminathan15.pdf},
  url = 	 {https://proceedings.mlr.press/v37/swaminathan15.html},
  abstract = 	 {We develop a learning principle and an efficient algorithm for batch learning from logged bandit feedback. This learning setting is ubiquitous in online systems (e.g., ad placement, web search, recommendation), where an algorithm makes a prediction (e.g., ad ranking) for a given input (e.g., query) and observes bandit feedback (e.g., user clicks on presented ads). We first address the counterfactual nature of the learning problem through propensity scoring. Next, we prove generalization error bounds that account for the variance of the propensity-weighted empirical risk estimator. These constructive bounds give rise to the Counterfactual Risk Minimization (CRM) principle. We show how CRM can be used to derive a new learning method – called Policy Optimizer for Exponential Models (POEM) – for learning stochastic linear rules for structured output prediction. We present a decomposition of the POEM objective that enables efficient stochastic gradient optimization. POEM is evaluated on several multi-label classification problems showing substantially improved robustness and generalization performance compared to the state-of-the-art.}
}

Endnote

%0 Conference Paper
%T Counterfactual Risk Minimization: Learning from Logged Bandit Feedback
%A Adith Swaminathan
%A Thorsten Joachims
%B Proceedings of the 32nd International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2015
%E Francis Bach
%E David Blei	
%F pmlr-v37-swaminathan15
%I PMLR
%P 814--823
%U https://proceedings.mlr.press/v37/swaminathan15.html
%V 37
%X We develop a learning principle and an efficient algorithm for batch learning from logged bandit feedback. This learning setting is ubiquitous in online systems (e.g., ad placement, web search, recommendation), where an algorithm makes a prediction (e.g., ad ranking) for a given input (e.g., query) and observes bandit feedback (e.g., user clicks on presented ads). We first address the counterfactual nature of the learning problem through propensity scoring. Next, we prove generalization error bounds that account for the variance of the propensity-weighted empirical risk estimator. These constructive bounds give rise to the Counterfactual Risk Minimization (CRM) principle. We show how CRM can be used to derive a new learning method – called Policy Optimizer for Exponential Models (POEM) – for learning stochastic linear rules for structured output prediction. We present a decomposition of the POEM objective that enables efficient stochastic gradient optimization. POEM is evaluated on several multi-label classification problems showing substantially improved robustness and generalization performance compared to the state-of-the-art.

RIS


TY  - CPAPER
TI  - Counterfactual Risk Minimization: Learning from Logged Bandit Feedback
AU  - Adith Swaminathan
AU  - Thorsten Joachims
BT  - Proceedings of the 32nd International Conference on Machine Learning
DA  - 2015/06/01
ED  - Francis Bach
ED  - David Blei	
ID  - pmlr-v37-swaminathan15
PB  - PMLR
DP  - Proceedings of Machine Learning Research
VL  - 37
SP  - 814
EP  - 823
L1  - http://proceedings.mlr.press/v37/swaminathan15.pdf
UR  - https://proceedings.mlr.press/v37/swaminathan15.html
AB  - We develop a learning principle and an efficient algorithm for batch learning from logged bandit feedback. This learning setting is ubiquitous in online systems (e.g., ad placement, web search, recommendation), where an algorithm makes a prediction (e.g., ad ranking) for a given input (e.g., query) and observes bandit feedback (e.g., user clicks on presented ads). We first address the counterfactual nature of the learning problem through propensity scoring. Next, we prove generalization error bounds that account for the variance of the propensity-weighted empirical risk estimator. These constructive bounds give rise to the Counterfactual Risk Minimization (CRM) principle. We show how CRM can be used to derive a new learning method – called Policy Optimizer for Exponential Models (POEM) – for learning stochastic linear rules for structured output prediction. We present a decomposition of the POEM objective that enables efficient stochastic gradient optimization. POEM is evaluated on several multi-label classification problems showing substantially improved robustness and generalization performance compared to the state-of-the-art.
ER  -

APA


Swaminathan, A. & Joachims, T.. (2015). Counterfactual Risk Minimization: Learning from Logged Bandit Feedback. Proceedings of the 32nd International Conference on Machine Learning, in Proceedings of Machine Learning Research 37:814-823 Available from https://proceedings.mlr.press/v37/swaminathan15.html.

Related Material

Download PDF