Contextual Blocking Bandits

Soumya Basu, Orestis Papadigenopoulos, Constantine Caramanis, Sanjay Shakkottai
Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, PMLR 130:271-279, 2021.

Abstract

We study a novel variant of the multi-armed bandit problem, where at each time step, the player observes an independently sampled context that determines the arms’ mean rewards. However, playing an arm blocks it (across all contexts) for a fixed number of future time steps. The above contextual setting captures important scenarios such as recommendation systems or ad placement with diverse users. This problem has been recently studied [Dickerson et al., AAAI 2018] in the full-information setting (i.e., assuming knowledge of the mean context-dependent arm rewards), where competitive ratio bounds have been derived. We focus on the bandit setting, where these means are initially unknown; we propose a UCB-based variant of the full-information algorithm that guarantees a $\mathcal{O}(\log T)$-regret w.r.t. an $\alpha$-optimal strategy in $T$ time steps, matching the $\Omega(\log(T))$ regret lower bound in this setting. Due to the time correlations caused by blocking, existing techniques for upper bounding regret fail. For proving our regret bounds, we introduce the novel concepts of delayed exploitation and opportunistic subsampling and combine them with ideas from combinatorial bandits and non-stationary Markov chains coupling.

Cite this Paper


BibTeX
@InProceedings{pmlr-v130-basu21a, title = { Contextual Blocking Bandits }, author = {Basu, Soumya and Papadigenopoulos, Orestis and Caramanis, Constantine and Shakkottai, Sanjay}, booktitle = {Proceedings of The 24th International Conference on Artificial Intelligence and Statistics}, pages = {271--279}, year = {2021}, editor = {Banerjee, Arindam and Fukumizu, Kenji}, volume = {130}, series = {Proceedings of Machine Learning Research}, month = {13--15 Apr}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v130/basu21a/basu21a.pdf}, url = {https://proceedings.mlr.press/v130/basu21a.html}, abstract = { We study a novel variant of the multi-armed bandit problem, where at each time step, the player observes an independently sampled context that determines the arms’ mean rewards. However, playing an arm blocks it (across all contexts) for a fixed number of future time steps. The above contextual setting captures important scenarios such as recommendation systems or ad placement with diverse users. This problem has been recently studied [Dickerson et al., AAAI 2018] in the full-information setting (i.e., assuming knowledge of the mean context-dependent arm rewards), where competitive ratio bounds have been derived. We focus on the bandit setting, where these means are initially unknown; we propose a UCB-based variant of the full-information algorithm that guarantees a $\mathcal{O}(\log T)$-regret w.r.t. an $\alpha$-optimal strategy in $T$ time steps, matching the $\Omega(\log(T))$ regret lower bound in this setting. Due to the time correlations caused by blocking, existing techniques for upper bounding regret fail. For proving our regret bounds, we introduce the novel concepts of delayed exploitation and opportunistic subsampling and combine them with ideas from combinatorial bandits and non-stationary Markov chains coupling. } }
Endnote
%0 Conference Paper %T Contextual Blocking Bandits %A Soumya Basu %A Orestis Papadigenopoulos %A Constantine Caramanis %A Sanjay Shakkottai %B Proceedings of The 24th International Conference on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 2021 %E Arindam Banerjee %E Kenji Fukumizu %F pmlr-v130-basu21a %I PMLR %P 271--279 %U https://proceedings.mlr.press/v130/basu21a.html %V 130 %X We study a novel variant of the multi-armed bandit problem, where at each time step, the player observes an independently sampled context that determines the arms’ mean rewards. However, playing an arm blocks it (across all contexts) for a fixed number of future time steps. The above contextual setting captures important scenarios such as recommendation systems or ad placement with diverse users. This problem has been recently studied [Dickerson et al., AAAI 2018] in the full-information setting (i.e., assuming knowledge of the mean context-dependent arm rewards), where competitive ratio bounds have been derived. We focus on the bandit setting, where these means are initially unknown; we propose a UCB-based variant of the full-information algorithm that guarantees a $\mathcal{O}(\log T)$-regret w.r.t. an $\alpha$-optimal strategy in $T$ time steps, matching the $\Omega(\log(T))$ regret lower bound in this setting. Due to the time correlations caused by blocking, existing techniques for upper bounding regret fail. For proving our regret bounds, we introduce the novel concepts of delayed exploitation and opportunistic subsampling and combine them with ideas from combinatorial bandits and non-stationary Markov chains coupling.
APA
Basu, S., Papadigenopoulos, O., Caramanis, C. & Shakkottai, S.. (2021). Contextual Blocking Bandits . Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 130:271-279 Available from https://proceedings.mlr.press/v130/basu21a.html.

Related Material