The Sample Complexity of Multiclass and Sparse Contextual Bandits

Liad Erez, Fan Chen, Alon Cohen, Tomer Koren, Yishay Mansour, Shay Moran, Alexander Rakhlin
Proceedings of Thirty Ninth Conference on Learning Theory, PMLR 336:2307-2338, 2026.

Abstract

We study contextual bandits in the stochastic i.i.d. setting, where a learner observes contexts drawn from an unknown distribution, selects actions from a finite set $\mathcal{A}$, and aims to identify an approximately optimal policy from a given class based on bandit feedback. Motivated by the important special case of bandit multiclass classification with zero-one rewards, we focus on the \emph{$s$-sparse} setting in which, for every context, the underlying reward vector has $L_1$-norm at most $s \ll |\mathcal{A}|$. Our main result is the design of algorithms that, with probability at least $1-\delta$, output an $\varepsilon$-optimal policy compared to policy class $\Pi$ using \begin{align*} \widetilde{O} \left( \left( \frac{s}{\varepsilon^2} + \frac{|\mathcal{A}|}{\varepsilon}\right) \log \frac{|\Pi|}{\delta}\right) \end{align*} samples. We further extend this bound to general Natarajan classes and complement it with a matching lower bound (up to logarithmic factors), thereby closing a substantial gap left by prior work (Erez et al., 2024a,b; Erez and Koren, 2025), which incurred an additional $\Theta(|\mathcal{A}|^9)$ dependence. We obtain these results via two complementary approaches. First, we analyze contextual bandits through the lens of contextual decision making with structured observations, designing an exploration-by-optimization algorithm whose sample complexity is governed by the \emph{decision-estimation coefficient} (DEC; Foster et al., 2021, 2022). We show that, with $s$-sparse rewards, the induced model class admits a sharp DEC bound that scales with $s$ and directly yields the optimal rate. Since this approach is largely information-theoretic and involves solving complex min-max optimization problems, we also develop a second, more specialized algorithmic method based on a low-variance exploration technique. This approach leads to concrete, tractable algorithms and naturally extends to contextual combinatorial semi-bandits, leading to improved sample complexity guarantees for bandit multiclass list classification.

Cite this Paper


BibTeX
@InProceedings{pmlr-v336-erez26a, title = {The Sample Complexity of Multiclass and Sparse Contextual Bandits}, author = {Erez, Liad and Chen, Fan and Cohen, Alon and Koren, Tomer and Mansour, Yishay and Moran, Shay and Rakhlin, Alexander}, booktitle = {Proceedings of Thirty Ninth Conference on Learning Theory}, pages = {2307--2338}, year = {2026}, editor = {Hanneke, Steve and Lattimore, Tor}, volume = {336}, series = {Proceedings of Machine Learning Research}, month = {29 Jun--03 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v336/main/assets/erez26a/erez26a.pdf}, url = {https://proceedings.mlr.press/v336/erez26a.html}, abstract = {We study contextual bandits in the stochastic i.i.d. setting, where a learner observes contexts drawn from an unknown distribution, selects actions from a finite set $\mathcal{A}$, and aims to identify an approximately optimal policy from a given class based on bandit feedback. Motivated by the important special case of bandit multiclass classification with zero-one rewards, we focus on the \emph{$s$-sparse} setting in which, for every context, the underlying reward vector has $L_1$-norm at most $s \ll |\mathcal{A}|$. Our main result is the design of algorithms that, with probability at least $1-\delta$, output an $\varepsilon$-optimal policy compared to policy class $\Pi$ using \begin{align*} \widetilde{O} \left( \left( \frac{s}{\varepsilon^2} + \frac{|\mathcal{A}|}{\varepsilon}\right) \log \frac{|\Pi|}{\delta}\right) \end{align*} samples. We further extend this bound to general Natarajan classes and complement it with a matching lower bound (up to logarithmic factors), thereby closing a substantial gap left by prior work (Erez et al., 2024a,b; Erez and Koren, 2025), which incurred an additional $\Theta(|\mathcal{A}|^9)$ dependence. We obtain these results via two complementary approaches. First, we analyze contextual bandits through the lens of contextual decision making with structured observations, designing an exploration-by-optimization algorithm whose sample complexity is governed by the \emph{decision-estimation coefficient} (DEC; Foster et al., 2021, 2022). We show that, with $s$-sparse rewards, the induced model class admits a sharp DEC bound that scales with $s$ and directly yields the optimal rate. Since this approach is largely information-theoretic and involves solving complex min-max optimization problems, we also develop a second, more specialized algorithmic method based on a low-variance exploration technique. This approach leads to concrete, tractable algorithms and naturally extends to contextual combinatorial semi-bandits, leading to improved sample complexity guarantees for bandit multiclass list classification.} }
Endnote
%0 Conference Paper %T The Sample Complexity of Multiclass and Sparse Contextual Bandits %A Liad Erez %A Fan Chen %A Alon Cohen %A Tomer Koren %A Yishay Mansour %A Shay Moran %A Alexander Rakhlin %B Proceedings of Thirty Ninth Conference on Learning Theory %C Proceedings of Machine Learning Research %D 2026 %E Steve Hanneke %E Tor Lattimore %F pmlr-v336-erez26a %I PMLR %P 2307--2338 %U https://proceedings.mlr.press/v336/erez26a.html %V 336 %X We study contextual bandits in the stochastic i.i.d. setting, where a learner observes contexts drawn from an unknown distribution, selects actions from a finite set $\mathcal{A}$, and aims to identify an approximately optimal policy from a given class based on bandit feedback. Motivated by the important special case of bandit multiclass classification with zero-one rewards, we focus on the \emph{$s$-sparse} setting in which, for every context, the underlying reward vector has $L_1$-norm at most $s \ll |\mathcal{A}|$. Our main result is the design of algorithms that, with probability at least $1-\delta$, output an $\varepsilon$-optimal policy compared to policy class $\Pi$ using \begin{align*} \widetilde{O} \left( \left( \frac{s}{\varepsilon^2} + \frac{|\mathcal{A}|}{\varepsilon}\right) \log \frac{|\Pi|}{\delta}\right) \end{align*} samples. We further extend this bound to general Natarajan classes and complement it with a matching lower bound (up to logarithmic factors), thereby closing a substantial gap left by prior work (Erez et al., 2024a,b; Erez and Koren, 2025), which incurred an additional $\Theta(|\mathcal{A}|^9)$ dependence. We obtain these results via two complementary approaches. First, we analyze contextual bandits through the lens of contextual decision making with structured observations, designing an exploration-by-optimization algorithm whose sample complexity is governed by the \emph{decision-estimation coefficient} (DEC; Foster et al., 2021, 2022). We show that, with $s$-sparse rewards, the induced model class admits a sharp DEC bound that scales with $s$ and directly yields the optimal rate. Since this approach is largely information-theoretic and involves solving complex min-max optimization problems, we also develop a second, more specialized algorithmic method based on a low-variance exploration technique. This approach leads to concrete, tractable algorithms and naturally extends to contextual combinatorial semi-bandits, leading to improved sample complexity guarantees for bandit multiclass list classification.
APA
Erez, L., Chen, F., Cohen, A., Koren, T., Mansour, Y., Moran, S. & Rakhlin, A.. (2026). The Sample Complexity of Multiclass and Sparse Contextual Bandits. Proceedings of Thirty Ninth Conference on Learning Theory, in Proceedings of Machine Learning Research 336:2307-2338 Available from https://proceedings.mlr.press/v336/erez26a.html.

Related Material