On the Power  of Adaptivity for $\varepsilon$-Best Arm Identification in Linear Bandits

Arnab Maiti; Yunbei Xu; Kevin Jamieson

On the Power of Adaptivity for $\varepsilon$-Best Arm Identification in Linear Bandits

Arnab Maiti, Yunbei Xu, Kevin Jamieson

Proceedings of Thirty Ninth Conference on Learning Theory, PMLR 336:4911-4968, 2026.

Abstract

We study the minimax sample complexity of $\varepsilon$-best arm identification in linear bandits, a classical pure-exploration problem. Given a compact action set $\mathcal{X}$ that spans $\mathbb{R}^d$ and an unknown reward vector $\theta\in\mathbb{R}^d$, the goal is to output an arm $\widehat{x}\in\mathcal{X}$ such that $⟨\widehat{x},\theta⟩\ge \max_{x\in\mathcal{X}} ⟨x,\theta⟩- \varepsilon$ with probability at least $1-\delta$, using as few samples as possible. Our aim is to better understand the power and limitations of adaptivity in this setting. We begin with non-adaptive algorithms. We present a non-adaptive fixed-design method with sample complexity $\mathcal{O}\!\left(\frac{d\log(1/\delta)}{\varepsilon^2}+\frac{w(\mathcal{X})^2}{\varepsilon^2}\right)$, where $w(\mathcal{X})$ is a Gaussian width term dependent on $\mathcal{X}$, and we prove a matching lower bound $\Omega\!\left(\frac{d\log(1/\delta)}{\varepsilon^2}+\frac{w(\mathcal{X})^2}{\varepsilon^2}\right)$ for all non-adaptive fixed-design methods. Moreover, $w(\mathcal{X})\le \mathcal{O}(d)$ for general $\mathcal{X}$, which is tight for sets such as the unit $\ell_2$ ball, and $w(\mathcal{X})\le \mathcal{O}(\sqrt{d\log|\mathcal{X}|})$ when $\mathcal{X}$ is finite, which is tight for the canonical basis ${e_1,\ldots,e_d}$. We then turn to adaptive sampling. For any finite action set $\mathcal{X}$, we prove the existence of an adaptive algorithm with sample complexity $\mathcal{O}\!\left(\frac{d\log(1/\delta)}{\varepsilon^2}+\frac{d\log(|\mathcal{X}|/d)}{\varepsilon^2}\right)$ via a generalization of Median Elimination, which is known to yield a $\log d$ improvement for the canonical basis. This raises a structural question: beyond the canonical basis, are there structured action sets for which adaptivity yields only logarithmic-factor improvements over the optimal non-adaptive rate? We answer in the affirmative for several natural action sets, namely the hypercube, the $\ell_2$ ball, $m$-sets, and multi-task multi-armed bandits. Finally, we show that logarithmic improvements are not the whole story. To our knowledge, we provide the first construction of an action set $\mathcal{X}$ for which adaptivity yields a \emph{polynomial-factor improvement} over every non-adaptive algorithm. A key ingredient behind this separation is an $\ell_2$-norm estimation subroutine: we design an adaptive algorithm that uses $\mathcal{O}\!\left(\frac{d\log(1/\delta)}{\varepsilon^2}\right)$ samples from the unit $\ell_2$ ball in $\mathbb{R}^d$ and outputs an estimate $\widehat r$ satisfying $|\widehat r-\|\theta\|_2|\le \varepsilon$ with probability at least $1-\delta$, where $\theta$ is the unknown reward vector. Taken together, these results illustrate when adaptivity can offer only modest savings and when it can enable genuine polynomial gains, sharpening our understanding of the role of adaptivity and geometry in pure exploration and experimental design.

Cite this Paper

BibTeX

@InProceedings{pmlr-v336-maiti26a,
  title = 	 {On the Power  of Adaptivity for $\varepsilon$-Best Arm Identification in Linear Bandits},
  author =       {Maiti, Arnab and Xu, Yunbei and Jamieson, Kevin},
  booktitle = 	 {Proceedings of Thirty Ninth Conference on Learning Theory},
  pages = 	 {4911--4968},
  year = 	 {2026},
  editor = 	 {Hanneke, Steve and Lattimore, Tor},
  volume = 	 {336},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {29 Jun--03 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v336/main/assets/maiti26a/maiti26a.pdf},
  url = 	 {https://proceedings.mlr.press/v336/maiti26a.html},
  abstract = 	 {We study the minimax sample complexity of $\varepsilon$-best arm identification in linear bandits, a classical pure-exploration problem. Given a compact action set $\mathcal{X}$ that spans $\mathbb{R}^d$ and an unknown reward vector $\theta\in\mathbb{R}^d$, the goal is to output an arm $\widehat{x}\in\mathcal{X}$ such that $⟨\widehat{x},\theta⟩\ge \max_{x\in\mathcal{X}} ⟨x,\theta⟩- \varepsilon$ with probability at least $1-\delta$, using as few samples as possible. Our aim is to better understand the power and limitations of adaptivity in this setting. We begin with non-adaptive algorithms. We present a non-adaptive fixed-design method with sample complexity $\mathcal{O}\!\left(\frac{d\log(1/\delta)}{\varepsilon^2}+\frac{w(\mathcal{X})^2}{\varepsilon^2}\right)$, where $w(\mathcal{X})$ is a Gaussian width term dependent on $\mathcal{X}$, and we prove a matching lower bound $\Omega\!\left(\frac{d\log(1/\delta)}{\varepsilon^2}+\frac{w(\mathcal{X})^2}{\varepsilon^2}\right)$ for all non-adaptive fixed-design methods. Moreover, $w(\mathcal{X})\le \mathcal{O}(d)$ for general $\mathcal{X}$, which is tight for sets such as the unit $\ell_2$ ball, and $w(\mathcal{X})\le \mathcal{O}(\sqrt{d\log|\mathcal{X}|})$ when $\mathcal{X}$ is finite, which is tight for the canonical basis ${e_1,\ldots,e_d}$. We then turn to adaptive sampling. For any finite action set $\mathcal{X}$, we prove the existence of an adaptive algorithm with sample complexity $\mathcal{O}\!\left(\frac{d\log(1/\delta)}{\varepsilon^2}+\frac{d\log(|\mathcal{X}|/d)}{\varepsilon^2}\right)$ via a generalization of Median Elimination, which is known to yield a $\log d$ improvement for the canonical basis. This raises a structural question: beyond the canonical basis, are there structured action sets for which adaptivity yields only logarithmic-factor improvements over the optimal non-adaptive rate? We answer in the affirmative for several natural action sets, namely the hypercube, the $\ell_2$ ball, $m$-sets, and multi-task multi-armed bandits. Finally, we show that logarithmic improvements are not the whole story. To our knowledge, we provide the first construction of an action set $\mathcal{X}$ for which adaptivity yields a \emph{polynomial-factor improvement} over every non-adaptive algorithm. A key ingredient behind this separation is an $\ell_2$-norm estimation subroutine: we design an adaptive algorithm that uses $\mathcal{O}\!\left(\frac{d\log(1/\delta)}{\varepsilon^2}\right)$ samples from the unit $\ell_2$ ball in $\mathbb{R}^d$ and outputs an estimate $\widehat r$ satisfying $|\widehat r-\|\theta\|_2|\le \varepsilon$ with probability at least $1-\delta$, where $\theta$ is the unknown reward vector. Taken together, these results illustrate when adaptivity can offer only modest savings and when it can enable genuine polynomial gains, sharpening our understanding of the role of adaptivity and geometry in pure exploration and experimental design.}
}

Endnote

%0 Conference Paper
%T On the Power  of Adaptivity for $\varepsilon$-Best Arm Identification in Linear Bandits
%A Arnab Maiti
%A Yunbei Xu
%A Kevin Jamieson
%B Proceedings of Thirty Ninth Conference on Learning Theory
%C Proceedings of Machine Learning Research
%D 2026
%E Steve Hanneke
%E Tor Lattimore	
%F pmlr-v336-maiti26a
%I PMLR
%P 4911--4968
%U https://proceedings.mlr.press/v336/maiti26a.html
%V 336
%X We study the minimax sample complexity of $\varepsilon$-best arm identification in linear bandits, a classical pure-exploration problem. Given a compact action set $\mathcal{X}$ that spans $\mathbb{R}^d$ and an unknown reward vector $\theta\in\mathbb{R}^d$, the goal is to output an arm $\widehat{x}\in\mathcal{X}$ such that $⟨\widehat{x},\theta⟩\ge \max_{x\in\mathcal{X}} ⟨x,\theta⟩- \varepsilon$ with probability at least $1-\delta$, using as few samples as possible. Our aim is to better understand the power and limitations of adaptivity in this setting. We begin with non-adaptive algorithms. We present a non-adaptive fixed-design method with sample complexity $\mathcal{O}\!\left(\frac{d\log(1/\delta)}{\varepsilon^2}+\frac{w(\mathcal{X})^2}{\varepsilon^2}\right)$, where $w(\mathcal{X})$ is a Gaussian width term dependent on $\mathcal{X}$, and we prove a matching lower bound $\Omega\!\left(\frac{d\log(1/\delta)}{\varepsilon^2}+\frac{w(\mathcal{X})^2}{\varepsilon^2}\right)$ for all non-adaptive fixed-design methods. Moreover, $w(\mathcal{X})\le \mathcal{O}(d)$ for general $\mathcal{X}$, which is tight for sets such as the unit $\ell_2$ ball, and $w(\mathcal{X})\le \mathcal{O}(\sqrt{d\log|\mathcal{X}|})$ when $\mathcal{X}$ is finite, which is tight for the canonical basis ${e_1,\ldots,e_d}$. We then turn to adaptive sampling. For any finite action set $\mathcal{X}$, we prove the existence of an adaptive algorithm with sample complexity $\mathcal{O}\!\left(\frac{d\log(1/\delta)}{\varepsilon^2}+\frac{d\log(|\mathcal{X}|/d)}{\varepsilon^2}\right)$ via a generalization of Median Elimination, which is known to yield a $\log d$ improvement for the canonical basis. This raises a structural question: beyond the canonical basis, are there structured action sets for which adaptivity yields only logarithmic-factor improvements over the optimal non-adaptive rate? We answer in the affirmative for several natural action sets, namely the hypercube, the $\ell_2$ ball, $m$-sets, and multi-task multi-armed bandits. Finally, we show that logarithmic improvements are not the whole story. To our knowledge, we provide the first construction of an action set $\mathcal{X}$ for which adaptivity yields a \emph{polynomial-factor improvement} over every non-adaptive algorithm. A key ingredient behind this separation is an $\ell_2$-norm estimation subroutine: we design an adaptive algorithm that uses $\mathcal{O}\!\left(\frac{d\log(1/\delta)}{\varepsilon^2}\right)$ samples from the unit $\ell_2$ ball in $\mathbb{R}^d$ and outputs an estimate $\widehat r$ satisfying $|\widehat r-\|\theta\|_2|\le \varepsilon$ with probability at least $1-\delta$, where $\theta$ is the unknown reward vector. Taken together, these results illustrate when adaptivity can offer only modest savings and when it can enable genuine polynomial gains, sharpening our understanding of the role of adaptivity and geometry in pure exploration and experimental design.

APA

Maiti, A., Xu, Y. & Jamieson, K.. (2026). On the Power  of Adaptivity for $\varepsilon$-Best Arm Identification in Linear Bandits. Proceedings of Thirty Ninth Conference on Learning Theory, in Proceedings of Machine Learning Research 336:4911-4968 Available from https://proceedings.mlr.press/v336/maiti26a.html.

Related Material

Download PDF