Boosted Off-Policy Learning

Ben London; Levi Lu; Ted Sandler; Thorsten Joachims

Boosted Off-Policy Learning

Ben London, Levi Lu, Ted Sandler, Thorsten Joachims

Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, PMLR 206:5614-5640, 2023.

Abstract

We propose the first boosting algorithm for off-policy learning from logged bandit feedback. Unlike existing boosting methods for supervised learning, our algorithm directly optimizes an estimate of the policy’s expected reward. We analyze this algorithm and prove that the excess empirical risk decreases (possibly exponentially fast) with each round of boosting, provided a “weak” learning condition is satisfied by the base learner. We further show how to reduce the base learner to supervised learning, which opens up a broad range of readily available base learners with practical benefits, such as decision trees. Experiments indicate that our algorithm inherits many desirable properties of tree-based boosting algorithms (e.g., robustness to feature scaling and hyperparameter tuning), and that it can outperform off-policy learning with deep neural networks as well as methods that simply regress on the observed rewards.

Cite this Paper

BibTeX


@InProceedings{pmlr-v206-london23a,
  title = 	 {Boosted Off-Policy Learning},
  author =       {London, Ben and Lu, Levi and Sandler, Ted and Joachims, Thorsten},
  booktitle = 	 {Proceedings of The 26th International Conference on Artificial Intelligence and Statistics},
  pages = 	 {5614--5640},
  year = 	 {2023},
  editor = 	 {Ruiz, Francisco and Dy, Jennifer and van de Meent, Jan-Willem},
  volume = 	 {206},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {25--27 Apr},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v206/london23a/london23a.pdf},
  url = 	 {https://proceedings.mlr.press/v206/london23a.html},
  abstract = 	 {We propose the first boosting algorithm for off-policy learning from logged bandit feedback. Unlike existing boosting methods for supervised learning, our algorithm directly optimizes an estimate of the policy’s expected reward. We analyze this algorithm and prove that the excess empirical risk decreases (possibly exponentially fast) with each round of boosting, provided a “weak” learning condition is satisfied by the base learner. We further show how to reduce the base learner to supervised learning, which opens up a broad range of readily available base learners with practical benefits, such as decision trees. Experiments indicate that our algorithm inherits many desirable properties of tree-based boosting algorithms (e.g., robustness to feature scaling and hyperparameter tuning), and that it can outperform off-policy learning with deep neural networks as well as methods that simply regress on the observed rewards.}
}

Endnote

%0 Conference Paper
%T Boosted Off-Policy Learning
%A Ben London
%A Levi Lu
%A Ted Sandler
%A Thorsten Joachims
%B Proceedings of The 26th International Conference on Artificial Intelligence and Statistics
%C Proceedings of Machine Learning Research
%D 2023
%E Francisco Ruiz
%E Jennifer Dy
%E Jan-Willem van de Meent	
%F pmlr-v206-london23a
%I PMLR
%P 5614--5640
%U https://proceedings.mlr.press/v206/london23a.html
%V 206
%X We propose the first boosting algorithm for off-policy learning from logged bandit feedback. Unlike existing boosting methods for supervised learning, our algorithm directly optimizes an estimate of the policy’s expected reward. We analyze this algorithm and prove that the excess empirical risk decreases (possibly exponentially fast) with each round of boosting, provided a “weak” learning condition is satisfied by the base learner. We further show how to reduce the base learner to supervised learning, which opens up a broad range of readily available base learners with practical benefits, such as decision trees. Experiments indicate that our algorithm inherits many desirable properties of tree-based boosting algorithms (e.g., robustness to feature scaling and hyperparameter tuning), and that it can outperform off-policy learning with deep neural networks as well as methods that simply regress on the observed rewards.

APA


London, B., Lu, L., Sandler, T. & Joachims, T.. (2023). Boosted Off-Policy Learning. Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 206:5614-5640 Available from https://proceedings.mlr.press/v206/london23a.html.

Related Material

Download PDF