Semiparametric Contextual Bandits
[edit]
Proceedings of the 35th International Conference on Machine Learning, PMLR 80:27762785, 2018.
Abstract
This paper studies semiparametric contextual bandits, a generalization of the linear stochastic bandit problem where the reward for a chosen action is modeled as a linear function of known action features confounded by a nonlinear actionindependent term. We design new algorithms that achieve $\tilde{O}(d\sqrt{T})$ regret over $T$ rounds, when the linear function is $d$dimensional, which matches the best known bounds for the simpler unconfounded case and improves on a recent result of Greenwald et al. (2017). Via an empirical evaluation, we show that our algorithms outperform prior approaches when there are nonlinear confounding effects on the rewards. Technically, our algorithms use a new reward estimator inspired by doublyrobust approaches and our proofs require new concentration inequalities for selfnormalized martingales.
Related Material


