# Towards Instance Optimal Bounds for Best Arm Identification

Lijie Chen, Jian Li, Mingda Qiao
Proceedings of the 2017 Conference on Learning Theory, PMLR 65:535-592, 2017.

#### Abstract

In the classical best arm identification (Best-$1$-Arm) problem, we are given $n$ stochastic bandit arms, each associated with a reward distribution with an unknown mean. Upon each play of an arm, we can get a reward sampled i.i.d. from its reward distribution. We would like to identify the arm with the largest mean with probability at least $1-δ$, using as few samples as possible. The problem has a long history and understanding its sample complexity has attracted significant attention since the last decade. However, the optimal sample complexity of the problem is still unknown. Recently, Chen and Li (2016) made an interesting conjecture, called gap-entropy conjecture, concerning the instance optimal sample complexity of Best-$1$-Arm. Given a Best-$1$-Arm instance $I$ (i.e., a set of arms), let $\mu_[i]$ denote the $i$th largest mean and $\Delta_[i]=\mu_[1]-\mu_[i]$ denote the corresponding gap. $H(I)=\sum_i=2^n\Delta_[i]^-2$ denotes the complexity of the instance. The gap-entropy conjecture states that for any instance $I$, $Ω\left(H(I)⋅\left(\lnδ^-1 + \mathsf{Ent}(I)\right)\right)$ is an instance lower bound, where $\mathsf{Ent}(I)$ is an entropy-like term determined by the gaps, and there is a $δ$-correct algorithm for Best-$1$-Arm with sample complexity $O\left(H(I)⋅\left(\lnδ^-1 + \mathsf{Ent}(I)\right)+\Delta_[2]^-2\ln\ln\Delta_[2]^-1\right)$. We note that $Θ\left(\Delta_[2]^-2\ln\ln\Delta_[2]^-1\right)$ is necessary and sufficient to solve the two-arm instance with the best and second best arms. If the conjecture is true, we would have a complete understanding of the instance-wise sample complexity of Best-$1$-Arm (up to constant factors). In this paper, we make significant progress towards a complete resolution of the gap-entropy conjecture. For the upper bound, we provide a highly nontrivial algorithm which requires $O\left(H(I)⋅\left(\lnδ^-1 + \mathsf{Ent}(I)\right)+\Delta_[2]^-2\ln\ln\Delta_[2]^-1\mathrmpolylog(n,δ^-1)\right)$samples in expectation for any instance $I$. For the lower bound, we show that for any Gaussian Best-$1$-Arm instance with gaps of the form $2^-k$, any $δ$-correct monotone algorithm requires at least $Ω\left(H(I)⋅\left(\lnδ^-1 + \mathsf{Ent}(I)\right)\right)$samples in expectation. Here, a monotone algorithm is one which uses no more samples (in expectation) on $I’$ than on $I$, if $I’$ is a sub-instance of $I$ obtained by removing some sub-optimal arms.