Variance-dependent best arm identification

Pinyan Lu; Chao Tao; Xiaojin Zhang

Variance-dependent best arm identification

Pinyan Lu, Chao Tao, Xiaojin Zhang

Proceedings of the Thirty-Seventh Conference on Uncertainty in Artificial Intelligence, PMLR 161:1120-1129, 2021.

Abstract

We study the problem of identifying the best arm in a stochastic multi-armed bandit game. Given a set of

$n$ arms indexed from

$1$ to

$n$ , each arm

$i$ is associated with an unknown reward distribution supported on

$[0,1]$ with mean

$\theta_i$ and variance

$\sigma_i^2$ . Assume

$\theta_1 > \theta_2 \geq \cdots \geq\theta_n$ . We propose an adaptive algorithm which explores the gaps and variances of the rewards of the arms and makes future decisions based on the gathered information using a novel approach called grouped median elimination. The proposed algorithm guarantees to output the best arm with probability

$(1-\delta)$ and uses at most

$O \left(\sum_{i = 1}^n \left(\frac{\sigma_i^2}{\Delta_i^2} + \frac{1}{\Delta_i}\right)(\ln \delta^{-1} + \ln \ln \Delta_i^{-1})\right)$ samples, where

$\Delta_i$ (

$i \geq 2$ ) denotes the reward gap between arm

$i$ and the best arm and we define

$\Delta_1 = \Delta_2$ . This achieves a significant advantage over the variance-independent algorithms in some favorable scenarios and is the first result that removes the extra

$\ln n$ factor on the best arm compared with the state-of-the-art. We further show that

$\Omega \left( \sum_{i = 1}^n \left( \frac{\sigma_i^2}{\Delta_i^2} + \frac{1}{\Delta_i} \right) \ln \delta^{-1} \right)$ samples are necessary for an algorithm to achieve the same goal, thereby illustrating that our algorithm is optimal up to doubly logarithmic terms.

Cite this Paper

BibTeX


@InProceedings{pmlr-v161-lu21a,
  title = 	 {Variance-dependent best arm identification},
  author =       {Lu, Pinyan and Tao, Chao and Zhang, Xiaojin},
  booktitle = 	 {Proceedings of the Thirty-Seventh Conference on Uncertainty in Artificial Intelligence},
  pages = 	 {1120--1129},
  year = 	 {2021},
  editor = 	 {de Campos, Cassio and Maathuis, Marloes H.},
  volume = 	 {161},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {27--30 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v161/lu21a/lu21a.pdf},
  url = 	 {https://proceedings.mlr.press/v161/lu21a.html},
  abstract = 	 {We study the problem of identifying the best arm in a stochastic multi-armed bandit game. Given a set of $n$ arms indexed from $1$ to $n$, each arm $i$ is associated with an unknown reward distribution supported on $[0,1]$ with mean $\theta_i$ and variance $\sigma_i^2$. Assume $\theta_1 > \theta_2 \geq \cdots \geq\theta_n$. We propose an adaptive algorithm which explores the gaps and variances of the rewards of the arms and makes future decisions based on the gathered information using a novel approach called grouped median elimination. The proposed algorithm guarantees to output the best arm with probability $(1-\delta)$ and uses at most $O \left(\sum_{i = 1}^n \left(\frac{\sigma_i^2}{\Delta_i^2} + \frac{1}{\Delta_i}\right)(\ln \delta^{-1} + \ln \ln \Delta_i^{-1})\right)$ samples, where $\Delta_i$ ($i \geq 2$) denotes the reward gap between arm $i$ and the best arm and we define $\Delta_1 = \Delta_2$. This achieves a significant advantage over the variance-independent algorithms in some favorable scenarios and is the first result that removes the extra $\ln n$ factor on the best arm compared with the state-of-the-art. We further show that $\Omega \left( \sum_{i = 1}^n \left( \frac{\sigma_i^2}{\Delta_i^2} + \frac{1}{\Delta_i} \right) \ln \delta^{-1} \right)$ samples are necessary for an algorithm to achieve the same goal, thereby illustrating that our algorithm is optimal up to doubly logarithmic terms.}
}

Endnote

%0 Conference Paper
%T Variance-dependent best arm identification
%A Pinyan Lu
%A Chao Tao
%A Xiaojin Zhang
%B Proceedings of the Thirty-Seventh Conference on Uncertainty in Artificial Intelligence
%C Proceedings of Machine Learning Research
%D 2021
%E Cassio de Campos
%E Marloes H. Maathuis	
%F pmlr-v161-lu21a
%I PMLR
%P 1120--1129
%U https://proceedings.mlr.press/v161/lu21a.html
%V 161
%X We study the problem of identifying the best arm in a stochastic multi-armed bandit game. Given a set of $n$ arms indexed from $1$ to $n$, each arm $i$ is associated with an unknown reward distribution supported on $[0,1]$ with mean $\theta_i$ and variance $\sigma_i^2$. Assume $\theta_1 > \theta_2 \geq \cdots \geq\theta_n$. We propose an adaptive algorithm which explores the gaps and variances of the rewards of the arms and makes future decisions based on the gathered information using a novel approach called grouped median elimination. The proposed algorithm guarantees to output the best arm with probability $(1-\delta)$ and uses at most $O \left(\sum_{i = 1}^n \left(\frac{\sigma_i^2}{\Delta_i^2} + \frac{1}{\Delta_i}\right)(\ln \delta^{-1} + \ln \ln \Delta_i^{-1})\right)$ samples, where $\Delta_i$ ($i \geq 2$) denotes the reward gap between arm $i$ and the best arm and we define $\Delta_1 = \Delta_2$. This achieves a significant advantage over the variance-independent algorithms in some favorable scenarios and is the first result that removes the extra $\ln n$ factor on the best arm compared with the state-of-the-art. We further show that $\Omega \left( \sum_{i = 1}^n \left( \frac{\sigma_i^2}{\Delta_i^2} + \frac{1}{\Delta_i} \right) \ln \delta^{-1} \right)$ samples are necessary for an algorithm to achieve the same goal, thereby illustrating that our algorithm is optimal up to doubly logarithmic terms.

APA


Lu, P., Tao, C. & Zhang, X.. (2021). Variance-dependent best arm identification. Proceedings of the Thirty-Seventh Conference on Uncertainty in Artificial Intelligence, in Proceedings of Machine Learning Research 161:1120-1129 Available from https://proceedings.mlr.press/v161/lu21a.html.

Related Material

Download PDF