Nonstochastic Bandits with Composite Anonymous Feedback

Nicolò Cesa-Bianchi, Claudio Gentile, Yishay Mansour
Proceedings of the 31st Conference On Learning Theory, PMLR 75:750-773, 2018.

Abstract

We investigate a nonstochastic bandit setting in which the loss of an action is not immediately charged to the player, but rather spread over at most d consecutive steps in an adversarial way. This implies that the instantaneous loss observed by the player at the end of each round is a sum of as many as d loss components of previously played actions. Hence, unlike the standard bandit setting with delayed feedback, here the player cannot observe the individual delayed losses, but only their sum. Our main contribution is a general reduction transforming a standard bandit algorithm into one that can operate in this harder setting. We also show how the regret of the transformed algorithm can be bounded in terms of the regret of the original algorithm. Our reduction cannot be improved in general: we prove a lower bound on the regret of any bandit algorithm in this setting that matches (up to log factors) the upper bound obtained via our reduction. Finally, we show how our reduction can be extended to more complex bandit settings, such as combinatorial linear bandits and online bandit convex optimization.

Cite this Paper


BibTeX
@InProceedings{pmlr-v75-cesa-bianchi18a, title = {Nonstochastic Bandits with Composite Anonymous Feedback}, author = {Cesa-Bianchi, Nicol\`o and Gentile, Claudio and Mansour, Yishay}, booktitle = {Proceedings of the 31st Conference On Learning Theory}, pages = {750--773}, year = {2018}, editor = {Bubeck, Sébastien and Perchet, Vianney and Rigollet, Philippe}, volume = {75}, series = {Proceedings of Machine Learning Research}, month = {06--09 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v75/cesa-bianchi18a/cesa-bianchi18a.pdf}, url = {https://proceedings.mlr.press/v75/cesa-bianchi18a.html}, abstract = {We investigate a nonstochastic bandit setting in which the loss of an action is not immediately charged to the player, but rather spread over at most d consecutive steps in an adversarial way. This implies that the instantaneous loss observed by the player at the end of each round is a sum of as many as d loss components of previously played actions. Hence, unlike the standard bandit setting with delayed feedback, here the player cannot observe the individual delayed losses, but only their sum. Our main contribution is a general reduction transforming a standard bandit algorithm into one that can operate in this harder setting. We also show how the regret of the transformed algorithm can be bounded in terms of the regret of the original algorithm. Our reduction cannot be improved in general: we prove a lower bound on the regret of any bandit algorithm in this setting that matches (up to log factors) the upper bound obtained via our reduction. Finally, we show how our reduction can be extended to more complex bandit settings, such as combinatorial linear bandits and online bandit convex optimization.} }
Endnote
%0 Conference Paper %T Nonstochastic Bandits with Composite Anonymous Feedback %A Nicolò Cesa-Bianchi %A Claudio Gentile %A Yishay Mansour %B Proceedings of the 31st Conference On Learning Theory %C Proceedings of Machine Learning Research %D 2018 %E Sébastien Bubeck %E Vianney Perchet %E Philippe Rigollet %F pmlr-v75-cesa-bianchi18a %I PMLR %P 750--773 %U https://proceedings.mlr.press/v75/cesa-bianchi18a.html %V 75 %X We investigate a nonstochastic bandit setting in which the loss of an action is not immediately charged to the player, but rather spread over at most d consecutive steps in an adversarial way. This implies that the instantaneous loss observed by the player at the end of each round is a sum of as many as d loss components of previously played actions. Hence, unlike the standard bandit setting with delayed feedback, here the player cannot observe the individual delayed losses, but only their sum. Our main contribution is a general reduction transforming a standard bandit algorithm into one that can operate in this harder setting. We also show how the regret of the transformed algorithm can be bounded in terms of the regret of the original algorithm. Our reduction cannot be improved in general: we prove a lower bound on the regret of any bandit algorithm in this setting that matches (up to log factors) the upper bound obtained via our reduction. Finally, we show how our reduction can be extended to more complex bandit settings, such as combinatorial linear bandits and online bandit convex optimization.
APA
Cesa-Bianchi, N., Gentile, C. & Mansour, Y.. (2018). Nonstochastic Bandits with Composite Anonymous Feedback. Proceedings of the 31st Conference On Learning Theory, in Proceedings of Machine Learning Research 75:750-773 Available from https://proceedings.mlr.press/v75/cesa-bianchi18a.html.

Related Material