[edit]
The Sample Complexity of Multiclass and Sparse Contextual Bandits
Proceedings of Thirty Ninth Conference on Learning Theory, PMLR 336:2307-2338, 2026.
Abstract
We study contextual bandits in the stochastic i.i.d. setting, where a learner observes contexts drawn from an unknown distribution, selects actions from a finite set $\mathcal{A}$, and aims to identify an approximately optimal policy from a given class based on bandit feedback. Motivated by the important special case of bandit multiclass classification with zero-one rewards, we focus on the \emph{$s$-sparse} setting in which, for every context, the underlying reward vector has $L_1$-norm at most $s \ll |\mathcal{A}|$. Our main result is the design of algorithms that, with probability at least $1-\delta$, output an $\varepsilon$-optimal policy compared to policy class $\Pi$ using \begin{align*} \widetilde{O} \left( \left( \frac{s}{\varepsilon^2} + \frac{|\mathcal{A}|}{\varepsilon}\right) \log \frac{|\Pi|}{\delta}\right) \end{align*} samples. We further extend this bound to general Natarajan classes and complement it with a matching lower bound (up to logarithmic factors), thereby closing a substantial gap left by prior work (Erez et al., 2024a,b; Erez and Koren, 2025), which incurred an additional $\Theta(|\mathcal{A}|^9)$ dependence. We obtain these results via two complementary approaches. First, we analyze contextual bandits through the lens of contextual decision making with structured observations, designing an exploration-by-optimization algorithm whose sample complexity is governed by the \emph{decision-estimation coefficient} (DEC; Foster et al., 2021, 2022). We show that, with $s$-sparse rewards, the induced model class admits a sharp DEC bound that scales with $s$ and directly yields the optimal rate. Since this approach is largely information-theoretic and involves solving complex min-max optimization problems, we also develop a second, more specialized algorithmic method based on a low-variance exploration technique. This approach leads to concrete, tractable algorithms and naturally extends to contextual combinatorial semi-bandits, leading to improved sample complexity guarantees for bandit multiclass list classification.