Experimental Design for Regret Minimization in Linear Bandits

Andrew Wagenmaker, Julian Katz-Samuels, Kevin Jamieson
Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, PMLR 130:3088-3096, 2021.

Abstract

In this paper we propose a novel experimental design-based algorithm to minimize regret in online stochastic linear and combinatorial bandits. While existing literature tends to focus on optimism-based algorithms–which have been shown to be suboptimal in many cases–our approach carefully plans which action to take by balancing the tradeoff between information gain and reward, overcoming the failures of optimism. In addition, we leverage tools from the theory of suprema of empirical processes to obtain regret guarantees that scale with the Gaussian width of the action set, avoiding wasteful union bounds. We provide state-of-the-art finite time regret guarantees and show that our algorithm can be applied in both the bandit and semi-bandit feedback regime. In the combinatorial semi-bandit setting, we show that our algorithm is computationally efficient and relies only on calls to a linear maximization oracle. In addition, we show that with slight modification our algorithm can be used for pure exploration, obtaining state-of-the-art pure exploration guarantees in the semi-bandit setting. Finally, we provide, to the best of our knowledge, the first example where optimism fails in the semi-bandit regime, and show that in this setting our algorithm succeeds.

Cite this Paper


BibTeX
@InProceedings{pmlr-v130-wagenmaker21a, title = { Experimental Design for Regret Minimization in Linear Bandits }, author = {Wagenmaker, Andrew and Katz-Samuels, Julian and Jamieson, Kevin}, booktitle = {Proceedings of The 24th International Conference on Artificial Intelligence and Statistics}, pages = {3088--3096}, year = {2021}, editor = {Banerjee, Arindam and Fukumizu, Kenji}, volume = {130}, series = {Proceedings of Machine Learning Research}, month = {13--15 Apr}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v130/wagenmaker21a/wagenmaker21a.pdf}, url = {https://proceedings.mlr.press/v130/wagenmaker21a.html}, abstract = { In this paper we propose a novel experimental design-based algorithm to minimize regret in online stochastic linear and combinatorial bandits. While existing literature tends to focus on optimism-based algorithms–which have been shown to be suboptimal in many cases–our approach carefully plans which action to take by balancing the tradeoff between information gain and reward, overcoming the failures of optimism. In addition, we leverage tools from the theory of suprema of empirical processes to obtain regret guarantees that scale with the Gaussian width of the action set, avoiding wasteful union bounds. We provide state-of-the-art finite time regret guarantees and show that our algorithm can be applied in both the bandit and semi-bandit feedback regime. In the combinatorial semi-bandit setting, we show that our algorithm is computationally efficient and relies only on calls to a linear maximization oracle. In addition, we show that with slight modification our algorithm can be used for pure exploration, obtaining state-of-the-art pure exploration guarantees in the semi-bandit setting. Finally, we provide, to the best of our knowledge, the first example where optimism fails in the semi-bandit regime, and show that in this setting our algorithm succeeds. } }
Endnote
%0 Conference Paper %T Experimental Design for Regret Minimization in Linear Bandits %A Andrew Wagenmaker %A Julian Katz-Samuels %A Kevin Jamieson %B Proceedings of The 24th International Conference on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 2021 %E Arindam Banerjee %E Kenji Fukumizu %F pmlr-v130-wagenmaker21a %I PMLR %P 3088--3096 %U https://proceedings.mlr.press/v130/wagenmaker21a.html %V 130 %X In this paper we propose a novel experimental design-based algorithm to minimize regret in online stochastic linear and combinatorial bandits. While existing literature tends to focus on optimism-based algorithms–which have been shown to be suboptimal in many cases–our approach carefully plans which action to take by balancing the tradeoff between information gain and reward, overcoming the failures of optimism. In addition, we leverage tools from the theory of suprema of empirical processes to obtain regret guarantees that scale with the Gaussian width of the action set, avoiding wasteful union bounds. We provide state-of-the-art finite time regret guarantees and show that our algorithm can be applied in both the bandit and semi-bandit feedback regime. In the combinatorial semi-bandit setting, we show that our algorithm is computationally efficient and relies only on calls to a linear maximization oracle. In addition, we show that with slight modification our algorithm can be used for pure exploration, obtaining state-of-the-art pure exploration guarantees in the semi-bandit setting. Finally, we provide, to the best of our knowledge, the first example where optimism fails in the semi-bandit regime, and show that in this setting our algorithm succeeds.
APA
Wagenmaker, A., Katz-Samuels, J. & Jamieson, K.. (2021). Experimental Design for Regret Minimization in Linear Bandits . Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 130:3088-3096 Available from https://proceedings.mlr.press/v130/wagenmaker21a.html.

Related Material