Complexity Analysis of a Countable-armed Bandit Problem

Anand Kalvit, Assaf Zeevi
Proceedings of The 34th International Conference on Algorithmic Learning Theory, PMLR 201:850-890, 2023.

Abstract

We consider a stochastic multi-armed bandit (MAB) problem motivated by “large” action spaces, and endowed with a population of arms containing exactly $K$ arm-types, each characterized by a distinct mean reward. The decision maker is oblivious to the statistical properties of reward distributions as well as the population-level distribution of different arm-types, and is precluded also from observing the type of an arm after play. We study the classical problem of minimizing the expected cumulative regret over a horizon of play $n$, and propose algorithms that achieve a rate-optimal finite-time instance-dependent regret of $\mathcal{O}\left( \log n \right)$. We also show that the instance-independent (minimax) regret is $\tilde{\mathcal{O}}\left( \sqrt{n} \right)$ when $K=2$. While the order of regret and complexity of the problem suggests a great degree of similarity to the classical MAB problem, properties of the performance bounds and salient aspects of algorithm design are quite distinct from the latter, as are the key primitives that determine complexity along with the analysis tools needed to study them.

Cite this Paper


BibTeX
@InProceedings{pmlr-v201-kalvit23a, title = {Complexity Analysis of a Countable-armed Bandit Problem}, author = {Kalvit, Anand and Zeevi, Assaf}, booktitle = {Proceedings of The 34th International Conference on Algorithmic Learning Theory}, pages = {850--890}, year = {2023}, editor = {Agrawal, Shipra and Orabona, Francesco}, volume = {201}, series = {Proceedings of Machine Learning Research}, month = {20 Feb--23 Feb}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v201/kalvit23a/kalvit23a.pdf}, url = {https://proceedings.mlr.press/v201/kalvit23a.html}, abstract = {We consider a stochastic multi-armed bandit (MAB) problem motivated by “large” action spaces, and endowed with a population of arms containing exactly $K$ arm-types, each characterized by a distinct mean reward. The decision maker is oblivious to the statistical properties of reward distributions as well as the population-level distribution of different arm-types, and is precluded also from observing the type of an arm after play. We study the classical problem of minimizing the expected cumulative regret over a horizon of play $n$, and propose algorithms that achieve a rate-optimal finite-time instance-dependent regret of $\mathcal{O}\left( \log n \right)$. We also show that the instance-independent (minimax) regret is $\tilde{\mathcal{O}}\left( \sqrt{n} \right)$ when $K=2$. While the order of regret and complexity of the problem suggests a great degree of similarity to the classical MAB problem, properties of the performance bounds and salient aspects of algorithm design are quite distinct from the latter, as are the key primitives that determine complexity along with the analysis tools needed to study them.} }
Endnote
%0 Conference Paper %T Complexity Analysis of a Countable-armed Bandit Problem %A Anand Kalvit %A Assaf Zeevi %B Proceedings of The 34th International Conference on Algorithmic Learning Theory %C Proceedings of Machine Learning Research %D 2023 %E Shipra Agrawal %E Francesco Orabona %F pmlr-v201-kalvit23a %I PMLR %P 850--890 %U https://proceedings.mlr.press/v201/kalvit23a.html %V 201 %X We consider a stochastic multi-armed bandit (MAB) problem motivated by “large” action spaces, and endowed with a population of arms containing exactly $K$ arm-types, each characterized by a distinct mean reward. The decision maker is oblivious to the statistical properties of reward distributions as well as the population-level distribution of different arm-types, and is precluded also from observing the type of an arm after play. We study the classical problem of minimizing the expected cumulative regret over a horizon of play $n$, and propose algorithms that achieve a rate-optimal finite-time instance-dependent regret of $\mathcal{O}\left( \log n \right)$. We also show that the instance-independent (minimax) regret is $\tilde{\mathcal{O}}\left( \sqrt{n} \right)$ when $K=2$. While the order of regret and complexity of the problem suggests a great degree of similarity to the classical MAB problem, properties of the performance bounds and salient aspects of algorithm design are quite distinct from the latter, as are the key primitives that determine complexity along with the analysis tools needed to study them.
APA
Kalvit, A. & Zeevi, A.. (2023). Complexity Analysis of a Countable-armed Bandit Problem. Proceedings of The 34th International Conference on Algorithmic Learning Theory, in Proceedings of Machine Learning Research 201:850-890 Available from https://proceedings.mlr.press/v201/kalvit23a.html.

Related Material