[edit]
Variance-dependent best arm identification
Proceedings of the Thirty-Seventh Conference on Uncertainty in Artificial Intelligence, PMLR 161:1120-1129, 2021.
Abstract
We study the problem of identifying the best arm in a stochastic multi-armed bandit game. Given a set of n arms indexed from 1 to n, each arm i is associated with an unknown reward distribution supported on [0,1] with mean θi and variance σ2i. Assume θ1>θ2≥⋯≥θn. We propose an adaptive algorithm which explores the gaps and variances of the rewards of the arms and makes future decisions based on the gathered information using a novel approach called grouped median elimination. The proposed algorithm guarantees to output the best arm with probability (1−δ) and uses at most O(∑ni=1(σ2iΔ2i+1Δi)(lnδ−1+lnlnΔ−1i)) samples, where Δi (i≥2) denotes the reward gap between arm i and the best arm and we define Δ1=Δ2. This achieves a significant advantage over the variance-independent algorithms in some favorable scenarios and is the first result that removes the extra lnn factor on the best arm compared with the state-of-the-art. We further show that Ω(∑ni=1(σ2iΔ2i+1Δi)lnδ−1) samples are necessary for an algorithm to achieve the same goal, thereby illustrating that our algorithm is optimal up to doubly logarithmic terms.