Corralling a Band of Bandit Algorithms

Alekh Agarwal, Haipeng Luo, Behnam Neyshabur, Robert E. Schapire
Proceedings of the 2017 Conference on Learning Theory, PMLR 65:12-38, 2017.

Abstract

We study the problem of combining multiple bandit algorithms (that is, online learning algorithms with partial feedback) with the goal of creating a master algorithm that performs almost as well as the best base algorithm \it if it were to be run on its own. The main challenge is that when run with a master, base algorithms unavoidably receive much less feedback and it is thus critical that the master not starve a base algorithm that might perform uncompetitively initially but would eventually outperform others if given enough feedback. We address this difficulty by devising a version of Online Mirror Descent with a special mirror map together with a sophisticated learning rate scheme. We show that this approach manages to achieve a more delicate balance between exploiting and exploring base algorithms than previous works yielding superior regret bounds. Our results are applicable to many settings, such as multi-armed bandits, contextual bandits, and convex bandits. As examples, we present two main applications. The first is to create an algorithm that enjoys worst-case robustness while at the same time performing much better when the environment is relatively easy. The second is to create an algorithm that works simultaneously under different assumptions of the environment, such as different priors or different loss structures.

Cite this Paper


BibTeX
@InProceedings{pmlr-v65-agarwal17b, title = {Corralling a Band of Bandit Algorithms}, author = {Agarwal, Alekh and Luo, Haipeng and Neyshabur, Behnam and Schapire, Robert E.}, booktitle = {Proceedings of the 2017 Conference on Learning Theory}, pages = {12--38}, year = {2017}, editor = {Kale, Satyen and Shamir, Ohad}, volume = {65}, series = {Proceedings of Machine Learning Research}, month = {07--10 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v65/agarwal17b/agarwal17b.pdf}, url = {https://proceedings.mlr.press/v65/agarwal17b.html}, abstract = {We study the problem of combining multiple bandit algorithms (that is, online learning algorithms with partial feedback) with the goal of creating a master algorithm that performs almost as well as the best base algorithm \it if it were to be run on its own. The main challenge is that when run with a master, base algorithms unavoidably receive much less feedback and it is thus critical that the master not starve a base algorithm that might perform uncompetitively initially but would eventually outperform others if given enough feedback. We address this difficulty by devising a version of Online Mirror Descent with a special mirror map together with a sophisticated learning rate scheme. We show that this approach manages to achieve a more delicate balance between exploiting and exploring base algorithms than previous works yielding superior regret bounds. Our results are applicable to many settings, such as multi-armed bandits, contextual bandits, and convex bandits. As examples, we present two main applications. The first is to create an algorithm that enjoys worst-case robustness while at the same time performing much better when the environment is relatively easy. The second is to create an algorithm that works simultaneously under different assumptions of the environment, such as different priors or different loss structures.} }
Endnote
%0 Conference Paper %T Corralling a Band of Bandit Algorithms %A Alekh Agarwal %A Haipeng Luo %A Behnam Neyshabur %A Robert E. Schapire %B Proceedings of the 2017 Conference on Learning Theory %C Proceedings of Machine Learning Research %D 2017 %E Satyen Kale %E Ohad Shamir %F pmlr-v65-agarwal17b %I PMLR %P 12--38 %U https://proceedings.mlr.press/v65/agarwal17b.html %V 65 %X We study the problem of combining multiple bandit algorithms (that is, online learning algorithms with partial feedback) with the goal of creating a master algorithm that performs almost as well as the best base algorithm \it if it were to be run on its own. The main challenge is that when run with a master, base algorithms unavoidably receive much less feedback and it is thus critical that the master not starve a base algorithm that might perform uncompetitively initially but would eventually outperform others if given enough feedback. We address this difficulty by devising a version of Online Mirror Descent with a special mirror map together with a sophisticated learning rate scheme. We show that this approach manages to achieve a more delicate balance between exploiting and exploring base algorithms than previous works yielding superior regret bounds. Our results are applicable to many settings, such as multi-armed bandits, contextual bandits, and convex bandits. As examples, we present two main applications. The first is to create an algorithm that enjoys worst-case robustness while at the same time performing much better when the environment is relatively easy. The second is to create an algorithm that works simultaneously under different assumptions of the environment, such as different priors or different loss structures.
APA
Agarwal, A., Luo, H., Neyshabur, B. & Schapire, R.E.. (2017). Corralling a Band of Bandit Algorithms. Proceedings of the 2017 Conference on Learning Theory, in Proceedings of Machine Learning Research 65:12-38 Available from https://proceedings.mlr.press/v65/agarwal17b.html.

Related Material