Taking a hint: How to leverage loss predictors in contextual bandits?

Chen-Yu Wei, Haipeng Luo, Alekh Agarwal
Proceedings of Thirty Third Conference on Learning Theory, PMLR 125:3583-3634, 2020.

Abstract

We initiate the study of learning in contextual bandits with the help of loss predictors. The main question we address is whether one can improve over the minimax regret O(T) for learning over T rounds, when the total error of the predicted losses relative to the realized losses, denoted as ET, is relatively small. We provide a complete answer to this question, with upper and lower bounds for various settings: adversarial and stochastic environments, known and unknown E, and single and multiple predictors. We show several surprising results, such as 1) the optimal regret is O(min when \mathcal{E} is known, in contrast to the standard and better bound \mathcal{O}(\sqrt{\mathcal{E}}) for non-contextual problems (such as multi-armed bandits); 2) the same bound cannot be achieved if \mathcal{E} is unknown, but as a remedy, \mathcal{O}(\sqrt{\mathcal{E}}T^\frac{1}{3}) is achievable; 3) with M predictors, a linear dependence on M is necessary, even though logarithmic dependence is possible for non-contextual problems. We also develop several novel algorithmic techniques to achieve matching upper bounds, including 1) a key \emph{action remapping} technique for optimal regret with known \mathcal{E}, 2) computationally efficient implementation of Catoni’s robust mean estimator via an ERM oracle in the stochastic setting with optimal regret, 3) an underestimator for \mathcal{E} via estimating the histogram with bins of exponentially increasing size for the stochastic setting with unknown \mathcal{E}, and 4) a self-referential scheme for learning with multiple predictors, all of which might be of independent interest.

Cite this Paper


BibTeX
@InProceedings{pmlr-v125-wei20a, title = {Taking a hint: How to leverage loss predictors in contextual bandits?}, author = {Wei, Chen-Yu and Luo, Haipeng and Agarwal, Alekh}, booktitle = {Proceedings of Thirty Third Conference on Learning Theory}, pages = {3583--3634}, year = {2020}, editor = {Abernethy, Jacob and Agarwal, Shivani}, volume = {125}, series = {Proceedings of Machine Learning Research}, month = {09--12 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v125/wei20a/wei20a.pdf}, url = {https://proceedings.mlr.press/v125/wei20a.html}, abstract = { We initiate the study of learning in contextual bandits with the help of loss predictors. The main question we address is whether one can improve over the minimax regret $\mathcal{O}(\sqrt{T})$ for learning over $T$ rounds, when the total error of the predicted losses relative to the realized losses, denoted as $\mathcal{E} \leq T$, is relatively small. We provide a complete answer to this question, with upper and lower bounds for various settings: adversarial and stochastic environments, known and unknown $\mathcal{E}$, and single and multiple predictors. We show several surprising results, such as 1) the optimal regret is $\mathcal{O}(\min\{\sqrt{T}, \sqrt{\mathcal{E}}T^\frac{1}{4}\})$ when $\mathcal{E}$ is known, in contrast to the standard and better bound $\mathcal{O}(\sqrt{\mathcal{E}})$ for non-contextual problems (such as multi-armed bandits); 2) the same bound cannot be achieved if $\mathcal{E}$ is unknown, but as a remedy, $\mathcal{O}(\sqrt{\mathcal{E}}T^\frac{1}{3})$ is achievable; 3) with $M$ predictors, a linear dependence on $M$ is necessary, even though logarithmic dependence is possible for non-contextual problems. We also develop several novel algorithmic techniques to achieve matching upper bounds, including 1) a key \emph{action remapping} technique for optimal regret with known $\mathcal{E}$, 2) computationally efficient implementation of Catoni’s robust mean estimator via an ERM oracle in the stochastic setting with optimal regret, 3) an underestimator for $\mathcal{E}$ via estimating the histogram with bins of exponentially increasing size for the stochastic setting with unknown $\mathcal{E}$, and 4) a self-referential scheme for learning with multiple predictors, all of which might be of independent interest. } }
Endnote
%0 Conference Paper %T Taking a hint: How to leverage loss predictors in contextual bandits? %A Chen-Yu Wei %A Haipeng Luo %A Alekh Agarwal %B Proceedings of Thirty Third Conference on Learning Theory %C Proceedings of Machine Learning Research %D 2020 %E Jacob Abernethy %E Shivani Agarwal %F pmlr-v125-wei20a %I PMLR %P 3583--3634 %U https://proceedings.mlr.press/v125/wei20a.html %V 125 %X We initiate the study of learning in contextual bandits with the help of loss predictors. The main question we address is whether one can improve over the minimax regret $\mathcal{O}(\sqrt{T})$ for learning over $T$ rounds, when the total error of the predicted losses relative to the realized losses, denoted as $\mathcal{E} \leq T$, is relatively small. We provide a complete answer to this question, with upper and lower bounds for various settings: adversarial and stochastic environments, known and unknown $\mathcal{E}$, and single and multiple predictors. We show several surprising results, such as 1) the optimal regret is $\mathcal{O}(\min\{\sqrt{T}, \sqrt{\mathcal{E}}T^\frac{1}{4}\})$ when $\mathcal{E}$ is known, in contrast to the standard and better bound $\mathcal{O}(\sqrt{\mathcal{E}})$ for non-contextual problems (such as multi-armed bandits); 2) the same bound cannot be achieved if $\mathcal{E}$ is unknown, but as a remedy, $\mathcal{O}(\sqrt{\mathcal{E}}T^\frac{1}{3})$ is achievable; 3) with $M$ predictors, a linear dependence on $M$ is necessary, even though logarithmic dependence is possible for non-contextual problems. We also develop several novel algorithmic techniques to achieve matching upper bounds, including 1) a key \emph{action remapping} technique for optimal regret with known $\mathcal{E}$, 2) computationally efficient implementation of Catoni’s robust mean estimator via an ERM oracle in the stochastic setting with optimal regret, 3) an underestimator for $\mathcal{E}$ via estimating the histogram with bins of exponentially increasing size for the stochastic setting with unknown $\mathcal{E}$, and 4) a self-referential scheme for learning with multiple predictors, all of which might be of independent interest.
APA
Wei, C., Luo, H. & Agarwal, A.. (2020). Taking a hint: How to leverage loss predictors in contextual bandits?. Proceedings of Thirty Third Conference on Learning Theory, in Proceedings of Machine Learning Research 125:3583-3634 Available from https://proceedings.mlr.press/v125/wei20a.html.

Related Material