Semiparametric Contextual Bandits

Akshay Krishnamurthy, Zhiwei Steven Wu, Vasilis Syrgkanis
Proceedings of the 35th International Conference on Machine Learning, PMLR 80:2776-2785, 2018.

Abstract

This paper studies semiparametric contextual bandits, a generalization of the linear stochastic bandit problem where the reward for a chosen action is modeled as a linear function of known action features confounded by a non-linear action-independent term. We design new algorithms that achieve $\tilde{O}(d\sqrt{T})$ regret over $T$ rounds, when the linear function is $d$-dimensional, which matches the best known bounds for the simpler unconfounded case and improves on a recent result of Greenwald et al. (2017). Via an empirical evaluation, we show that our algorithms outperform prior approaches when there are non-linear confounding effects on the rewards. Technically, our algorithms use a new reward estimator inspired by doubly-robust approaches and our proofs require new concentration inequalities for self-normalized martingales.

Cite this Paper


BibTeX
@InProceedings{pmlr-v80-krishnamurthy18a, title = {Semiparametric Contextual Bandits}, author = {Krishnamurthy, Akshay and Wu, Zhiwei Steven and Syrgkanis, Vasilis}, booktitle = {Proceedings of the 35th International Conference on Machine Learning}, pages = {2776--2785}, year = {2018}, editor = {Dy, Jennifer and Krause, Andreas}, volume = {80}, series = {Proceedings of Machine Learning Research}, month = {10--15 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v80/krishnamurthy18a/krishnamurthy18a.pdf}, url = {http://proceedings.mlr.press/v80/krishnamurthy18a.html}, abstract = {This paper studies semiparametric contextual bandits, a generalization of the linear stochastic bandit problem where the reward for a chosen action is modeled as a linear function of known action features confounded by a non-linear action-independent term. We design new algorithms that achieve $\tilde{O}(d\sqrt{T})$ regret over $T$ rounds, when the linear function is $d$-dimensional, which matches the best known bounds for the simpler unconfounded case and improves on a recent result of Greenwald et al. (2017). Via an empirical evaluation, we show that our algorithms outperform prior approaches when there are non-linear confounding effects on the rewards. Technically, our algorithms use a new reward estimator inspired by doubly-robust approaches and our proofs require new concentration inequalities for self-normalized martingales.} }
Endnote
%0 Conference Paper %T Semiparametric Contextual Bandits %A Akshay Krishnamurthy %A Zhiwei Steven Wu %A Vasilis Syrgkanis %B Proceedings of the 35th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2018 %E Jennifer Dy %E Andreas Krause %F pmlr-v80-krishnamurthy18a %I PMLR %P 2776--2785 %U http://proceedings.mlr.press/v80/krishnamurthy18a.html %V 80 %X This paper studies semiparametric contextual bandits, a generalization of the linear stochastic bandit problem where the reward for a chosen action is modeled as a linear function of known action features confounded by a non-linear action-independent term. We design new algorithms that achieve $\tilde{O}(d\sqrt{T})$ regret over $T$ rounds, when the linear function is $d$-dimensional, which matches the best known bounds for the simpler unconfounded case and improves on a recent result of Greenwald et al. (2017). Via an empirical evaluation, we show that our algorithms outperform prior approaches when there are non-linear confounding effects on the rewards. Technically, our algorithms use a new reward estimator inspired by doubly-robust approaches and our proofs require new concentration inequalities for self-normalized martingales.
APA
Krishnamurthy, A., Wu, Z.S. & Syrgkanis, V.. (2018). Semiparametric Contextual Bandits. Proceedings of the 35th International Conference on Machine Learning, in Proceedings of Machine Learning Research 80:2776-2785 Available from http://proceedings.mlr.press/v80/krishnamurthy18a.html.

Related Material