Contextual Bandit Algorithms with Supervised Learning Guarantees

Alina Beygelzimer, John Langford, Lihong Li, Lev Reyzin, Robert Schapire
; Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, JMLR Workshop and Conference Proceedings 15:19-26, 2011.

Abstract

We address the problem of competing with any large set of N policies in the non-stochastic bandit setting, where the learner must repeatedly select among K actions but observes only the reward of the chosen action. We present a modification of the Exp4 algorithm of [Auer et al. 2002] called Exp4.P, which with high probability incurs regret at most O(\sqrtKT\ln N). Such a bound does not hold for Exp4 due to the large variance of the importance-weighted estimates used in the algorithm. The new algorithm is tested empirically in a large-scale, real-world dataset. For the stochastic version of the problem, we can use Exp4.P as a subroutine to compete with a possibly infinite set of policies of VC-dimension d while incurring regret at most O(\sqrtTd\ln T) with high probability. These guarantees improve on those of all previous algorithms, whether in a stochastic or adversarial environment, and bring us closer to providing guarantees for this setting that are comparable to those in standard supervised learning.

Cite this Paper


BibTeX
@InProceedings{pmlr-v15-beygelzimer11a, title = {Contextual Bandit Algorithms with Supervised Learning Guarantees}, author = {Alina Beygelzimer and John Langford and Lihong Li and Lev Reyzin and Robert Schapire}, pages = {19--26}, year = {2011}, editor = {Geoffrey Gordon and David Dunson and Miroslav Dudík}, volume = {15}, series = {Proceedings of Machine Learning Research}, address = {Fort Lauderdale, FL, USA}, month = {11--13 Apr}, publisher = {JMLR Workshop and Conference Proceedings}, pdf = {http://proceedings.mlr.press/v15/beygelzimer11a/beygelzimer11a.pdf}, url = {http://proceedings.mlr.press/v15/beygelzimer11a.html}, abstract = {We address the problem of competing with any large set of N policies in the non-stochastic bandit setting, where the learner must repeatedly select among K actions but observes only the reward of the chosen action. We present a modification of the Exp4 algorithm of [Auer et al. 2002] called Exp4.P, which with high probability incurs regret at most O(\sqrtKT\ln N). Such a bound does not hold for Exp4 due to the large variance of the importance-weighted estimates used in the algorithm. The new algorithm is tested empirically in a large-scale, real-world dataset. For the stochastic version of the problem, we can use Exp4.P as a subroutine to compete with a possibly infinite set of policies of VC-dimension d while incurring regret at most O(\sqrtTd\ln T) with high probability. These guarantees improve on those of all previous algorithms, whether in a stochastic or adversarial environment, and bring us closer to providing guarantees for this setting that are comparable to those in standard supervised learning. } }
Endnote
%0 Conference Paper %T Contextual Bandit Algorithms with Supervised Learning Guarantees %A Alina Beygelzimer %A John Langford %A Lihong Li %A Lev Reyzin %A Robert Schapire %B Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 2011 %E Geoffrey Gordon %E David Dunson %E Miroslav Dudík %F pmlr-v15-beygelzimer11a %I PMLR %J Proceedings of Machine Learning Research %P 19--26 %U http://proceedings.mlr.press %V 15 %W PMLR %X We address the problem of competing with any large set of N policies in the non-stochastic bandit setting, where the learner must repeatedly select among K actions but observes only the reward of the chosen action. We present a modification of the Exp4 algorithm of [Auer et al. 2002] called Exp4.P, which with high probability incurs regret at most O(\sqrtKT\ln N). Such a bound does not hold for Exp4 due to the large variance of the importance-weighted estimates used in the algorithm. The new algorithm is tested empirically in a large-scale, real-world dataset. For the stochastic version of the problem, we can use Exp4.P as a subroutine to compete with a possibly infinite set of policies of VC-dimension d while incurring regret at most O(\sqrtTd\ln T) with high probability. These guarantees improve on those of all previous algorithms, whether in a stochastic or adversarial environment, and bring us closer to providing guarantees for this setting that are comparable to those in standard supervised learning.
RIS
TY - CPAPER TI - Contextual Bandit Algorithms with Supervised Learning Guarantees AU - Alina Beygelzimer AU - John Langford AU - Lihong Li AU - Lev Reyzin AU - Robert Schapire BT - Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics PY - 2011/06/14 DA - 2011/06/14 ED - Geoffrey Gordon ED - David Dunson ED - Miroslav Dudík ID - pmlr-v15-beygelzimer11a PB - PMLR SP - 19 DP - PMLR EP - 26 L1 - http://proceedings.mlr.press/v15/beygelzimer11a/beygelzimer11a.pdf UR - http://proceedings.mlr.press/v15/beygelzimer11a.html AB - We address the problem of competing with any large set of N policies in the non-stochastic bandit setting, where the learner must repeatedly select among K actions but observes only the reward of the chosen action. We present a modification of the Exp4 algorithm of [Auer et al. 2002] called Exp4.P, which with high probability incurs regret at most O(\sqrtKT\ln N). Such a bound does not hold for Exp4 due to the large variance of the importance-weighted estimates used in the algorithm. The new algorithm is tested empirically in a large-scale, real-world dataset. For the stochastic version of the problem, we can use Exp4.P as a subroutine to compete with a possibly infinite set of policies of VC-dimension d while incurring regret at most O(\sqrtTd\ln T) with high probability. These guarantees improve on those of all previous algorithms, whether in a stochastic or adversarial environment, and bring us closer to providing guarantees for this setting that are comparable to those in standard supervised learning. ER -
APA
Beygelzimer, A., Langford, J., Li, L., Reyzin, L. & Schapire, R.. (2011). Contextual Bandit Algorithms with Supervised Learning Guarantees. Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, in PMLR 15:19-26

Related Material