First-order regret bounds for combinatorial semi-bandits

Gergely Neu

First-order regret bounds for combinatorial semi-bandits

Gergely Neu

Proceedings of The 28th Conference on Learning Theory, PMLR 40:1360-1375, 2015.

Abstract

We consider the problem of online combinatorial optimization under semi-bandit feedback, where a learner has to repeatedly pick actions from a combinatorial decision set in order to minimize the total losses associated with its decisions. After making each decision, the learner observes the losses associated with its action, but not other losses. For this problem, there are several learning algorithms that guarantee that the learner’s expected regret grows as \widetildeO(\sqrtT) with the number of rounds T. In this paper, we propose an algorithm that improves this scaling to \widetildeO(\sqrtL_T^*), where L_T^* is the total loss of the best action. Our algorithm is among the first to achieve such guarantees in a partial-feedback scheme, and the first one to do so in a combinatorial setting.

Cite this Paper

BibTeX


@InProceedings{pmlr-v40-Neu15,
  title = 	 {First-order regret bounds for combinatorial semi-bandits},
  author = 	 {Neu, Gergely},
  booktitle = 	 {Proceedings of The 28th Conference on Learning Theory},
  pages = 	 {1360--1375},
  year = 	 {2015},
  editor = 	 {Grünwald, Peter and Hazan, Elad and Kale, Satyen},
  volume = 	 {40},
  series = 	 {Proceedings of Machine Learning Research},
  address = 	 {Paris, France},
  month = 	 {03--06 Jul},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v40/Neu15.pdf},
  url = 	 {https://proceedings.mlr.press/v40/Neu15.html},
  abstract = 	 {We consider the problem of online combinatorial optimization under semi-bandit feedback, where a learner has to repeatedly pick actions from a combinatorial decision set in order to minimize the total losses associated with its decisions. After making each decision, the learner observes the losses associated with its action, but not other losses. For this problem, there are several learning algorithms that guarantee that the learner’s expected regret grows as \widetildeO(\sqrtT) with the number of rounds T. In this paper, we propose an algorithm that improves this scaling to \widetildeO(\sqrtL_T^*), where L_T^* is the total loss of the best action. Our algorithm is among the first to achieve such guarantees in a partial-feedback scheme, and the first one to do so in a combinatorial setting.}
}

Endnote

%0 Conference Paper
%T First-order regret bounds for combinatorial semi-bandits
%A Gergely Neu
%B Proceedings of The 28th Conference on Learning Theory
%C Proceedings of Machine Learning Research
%D 2015
%E Peter Grünwald
%E Elad Hazan
%E Satyen Kale	
%F pmlr-v40-Neu15
%I PMLR
%P 1360--1375
%U https://proceedings.mlr.press/v40/Neu15.html
%V 40
%X We consider the problem of online combinatorial optimization under semi-bandit feedback, where a learner has to repeatedly pick actions from a combinatorial decision set in order to minimize the total losses associated with its decisions. After making each decision, the learner observes the losses associated with its action, but not other losses. For this problem, there are several learning algorithms that guarantee that the learner’s expected regret grows as \widetildeO(\sqrtT) with the number of rounds T. In this paper, we propose an algorithm that improves this scaling to \widetildeO(\sqrtL_T^*), where L_T^* is the total loss of the best action. Our algorithm is among the first to achieve such guarantees in a partial-feedback scheme, and the first one to do so in a combinatorial setting.

RIS


TY  - CPAPER
TI  - First-order regret bounds for combinatorial semi-bandits
AU  - Gergely Neu
BT  - Proceedings of The 28th Conference on Learning Theory
DA  - 2015/06/26
ED  - Peter Grünwald
ED  - Elad Hazan
ED  - Satyen Kale	
ID  - pmlr-v40-Neu15
PB  - PMLR
DP  - Proceedings of Machine Learning Research
VL  - 40
SP  - 1360
EP  - 1375
L1  - http://proceedings.mlr.press/v40/Neu15.pdf
UR  - https://proceedings.mlr.press/v40/Neu15.html
AB  - We consider the problem of online combinatorial optimization under semi-bandit feedback, where a learner has to repeatedly pick actions from a combinatorial decision set in order to minimize the total losses associated with its decisions. After making each decision, the learner observes the losses associated with its action, but not other losses. For this problem, there are several learning algorithms that guarantee that the learner’s expected regret grows as \widetildeO(\sqrtT) with the number of rounds T. In this paper, we propose an algorithm that improves this scaling to \widetildeO(\sqrtL_T^*), where L_T^* is the total loss of the best action. Our algorithm is among the first to achieve such guarantees in a partial-feedback scheme, and the first one to do so in a combinatorial setting.
ER  -

APA


Neu, G.. (2015). First-order regret bounds for combinatorial semi-bandits. Proceedings of The 28th Conference on Learning Theory, in Proceedings of Machine Learning Research 40:1360-1375 Available from https://proceedings.mlr.press/v40/Neu15.html.

Related Material

Download PDF