First-order regret bounds for combinatorial semi-bandits

Gergely Neu
Proceedings of The 28th Conference on Learning Theory, PMLR 40:1360-1375, 2015.

Abstract

We consider the problem of online combinatorial optimization under semi-bandit feedback, where a learner has to repeatedly pick actions from a combinatorial decision set in order to minimize the total losses associated with its decisions. After making each decision, the learner observes the losses associated with its action, but not other losses. For this problem, there are several learning algorithms that guarantee that the learner’s expected regret grows as \widetildeO(\sqrtT) with the number of rounds T. In this paper, we propose an algorithm that improves this scaling to \widetildeO(\sqrtL_T^*), where L_T^* is the total loss of the best action. Our algorithm is among the first to achieve such guarantees in a partial-feedback scheme, and the first one to do so in a combinatorial setting.

Cite this Paper


BibTeX
@InProceedings{pmlr-v40-Neu15, title = {First-order regret bounds for combinatorial semi-bandits}, author = {Neu, Gergely}, booktitle = {Proceedings of The 28th Conference on Learning Theory}, pages = {1360--1375}, year = {2015}, editor = {Grünwald, Peter and Hazan, Elad and Kale, Satyen}, volume = {40}, series = {Proceedings of Machine Learning Research}, address = {Paris, France}, month = {03--06 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v40/Neu15.pdf}, url = {https://proceedings.mlr.press/v40/Neu15.html}, abstract = {We consider the problem of online combinatorial optimization under semi-bandit feedback, where a learner has to repeatedly pick actions from a combinatorial decision set in order to minimize the total losses associated with its decisions. After making each decision, the learner observes the losses associated with its action, but not other losses. For this problem, there are several learning algorithms that guarantee that the learner’s expected regret grows as \widetildeO(\sqrtT) with the number of rounds T. In this paper, we propose an algorithm that improves this scaling to \widetildeO(\sqrtL_T^*), where L_T^* is the total loss of the best action. Our algorithm is among the first to achieve such guarantees in a partial-feedback scheme, and the first one to do so in a combinatorial setting.} }
Endnote
%0 Conference Paper %T First-order regret bounds for combinatorial semi-bandits %A Gergely Neu %B Proceedings of The 28th Conference on Learning Theory %C Proceedings of Machine Learning Research %D 2015 %E Peter Grünwald %E Elad Hazan %E Satyen Kale %F pmlr-v40-Neu15 %I PMLR %P 1360--1375 %U https://proceedings.mlr.press/v40/Neu15.html %V 40 %X We consider the problem of online combinatorial optimization under semi-bandit feedback, where a learner has to repeatedly pick actions from a combinatorial decision set in order to minimize the total losses associated with its decisions. After making each decision, the learner observes the losses associated with its action, but not other losses. For this problem, there are several learning algorithms that guarantee that the learner’s expected regret grows as \widetildeO(\sqrtT) with the number of rounds T. In this paper, we propose an algorithm that improves this scaling to \widetildeO(\sqrtL_T^*), where L_T^* is the total loss of the best action. Our algorithm is among the first to achieve such guarantees in a partial-feedback scheme, and the first one to do so in a combinatorial setting.
RIS
TY - CPAPER TI - First-order regret bounds for combinatorial semi-bandits AU - Gergely Neu BT - Proceedings of The 28th Conference on Learning Theory DA - 2015/06/26 ED - Peter Grünwald ED - Elad Hazan ED - Satyen Kale ID - pmlr-v40-Neu15 PB - PMLR DP - Proceedings of Machine Learning Research VL - 40 SP - 1360 EP - 1375 L1 - http://proceedings.mlr.press/v40/Neu15.pdf UR - https://proceedings.mlr.press/v40/Neu15.html AB - We consider the problem of online combinatorial optimization under semi-bandit feedback, where a learner has to repeatedly pick actions from a combinatorial decision set in order to minimize the total losses associated with its decisions. After making each decision, the learner observes the losses associated with its action, but not other losses. For this problem, there are several learning algorithms that guarantee that the learner’s expected regret grows as \widetildeO(\sqrtT) with the number of rounds T. In this paper, we propose an algorithm that improves this scaling to \widetildeO(\sqrtL_T^*), where L_T^* is the total loss of the best action. Our algorithm is among the first to achieve such guarantees in a partial-feedback scheme, and the first one to do so in a combinatorial setting. ER -
APA
Neu, G.. (2015). First-order regret bounds for combinatorial semi-bandits. Proceedings of The 28th Conference on Learning Theory, in Proceedings of Machine Learning Research 40:1360-1375 Available from https://proceedings.mlr.press/v40/Neu15.html.

Related Material