For Universal Multiclass Online Learning, Bandit Feedback and Full Supervision are Equivalent

Steve Hanneke, Amirreza Shaeiri, Hongao Wang
Proceedings of The 36th International Conference on Algorithmic Learning Theory, PMLR 272:539-559, 2025.

Abstract

We study the problem of multiclass online learning under \emph{bandit feedback} within the framework of \emph{universal learning} [Bousquet, Hanneke, Moran, van Handel, and Yehudayoff; STOC ’21]. \par In multiclass online learning under bandit feedback, it is well known that no concept class $\mathcal{C}$ is \emph{uniformly} learnable when the effective label space is unbounded, or in other words, no online learner guarantees a finite bound on the expected number of mistakes holding uniformly over all realizable data sequences. In contrast, surprisingly, we show that in the case of \emph{universal} learnability of concept classes $\mathcal{C}$, there is an exact equivalence between multiclass online learnability under bandit feedback and full supervision, in both the realizable and agnostic settings. \par More specifically, our first main contribution is a theory that establishes an inherent dichotomy in multiclass online learning under bandit feedback within the realizable setting. In particular, for any concept class $\mathcal{C}$ even when the effective label space is unbounded, we have: (1) If $\mathcal{C}$ does not have an infinite multiclass Littlestone tree, then there is a deterministic online learner that makes only finitely many mistakes against any realizable adversary, crucially without placing a uniform bound on the number of mistakes. (2) If $\mathcal{C}$ has an infinite multiclass Littlestone tree, then there is a strategy for the realizable adversary that forces any learner, including randomized, to make linear expected number of mistakes. Furthermore, our second main contribution reveals a similar trend in the agnostic setting.

Cite this Paper


BibTeX
@InProceedings{pmlr-v272-hanneke25b, title = {For Universal Multiclass Online Learning, Bandit Feedback and Full Supervision are Equivalent}, author = {Hanneke, Steve and Shaeiri, Amirreza and Wang, Hongao}, booktitle = {Proceedings of The 36th International Conference on Algorithmic Learning Theory}, pages = {539--559}, year = {2025}, editor = {Kamath, Gautam and Loh, Po-Ling}, volume = {272}, series = {Proceedings of Machine Learning Research}, month = {24--27 Feb}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v272/main/assets/hanneke25b/hanneke25b.pdf}, url = {https://proceedings.mlr.press/v272/hanneke25b.html}, abstract = {We study the problem of multiclass online learning under \emph{bandit feedback} within the framework of \emph{universal learning} [Bousquet, Hanneke, Moran, van Handel, and Yehudayoff; STOC ’21]. \par In multiclass online learning under bandit feedback, it is well known that no concept class $\mathcal{C}$ is \emph{uniformly} learnable when the effective label space is unbounded, or in other words, no online learner guarantees a finite bound on the expected number of mistakes holding uniformly over all realizable data sequences. In contrast, surprisingly, we show that in the case of \emph{universal} learnability of concept classes $\mathcal{C}$, there is an exact equivalence between multiclass online learnability under bandit feedback and full supervision, in both the realizable and agnostic settings. \par More specifically, our first main contribution is a theory that establishes an inherent dichotomy in multiclass online learning under bandit feedback within the realizable setting. In particular, for any concept class $\mathcal{C}$ even when the effective label space is unbounded, we have: (1) If $\mathcal{C}$ does not have an infinite multiclass Littlestone tree, then there is a deterministic online learner that makes only finitely many mistakes against any realizable adversary, crucially without placing a uniform bound on the number of mistakes. (2) If $\mathcal{C}$ has an infinite multiclass Littlestone tree, then there is a strategy for the realizable adversary that forces any learner, including randomized, to make linear expected number of mistakes. Furthermore, our second main contribution reveals a similar trend in the agnostic setting.} }
Endnote
%0 Conference Paper %T For Universal Multiclass Online Learning, Bandit Feedback and Full Supervision are Equivalent %A Steve Hanneke %A Amirreza Shaeiri %A Hongao Wang %B Proceedings of The 36th International Conference on Algorithmic Learning Theory %C Proceedings of Machine Learning Research %D 2025 %E Gautam Kamath %E Po-Ling Loh %F pmlr-v272-hanneke25b %I PMLR %P 539--559 %U https://proceedings.mlr.press/v272/hanneke25b.html %V 272 %X We study the problem of multiclass online learning under \emph{bandit feedback} within the framework of \emph{universal learning} [Bousquet, Hanneke, Moran, van Handel, and Yehudayoff; STOC ’21]. \par In multiclass online learning under bandit feedback, it is well known that no concept class $\mathcal{C}$ is \emph{uniformly} learnable when the effective label space is unbounded, or in other words, no online learner guarantees a finite bound on the expected number of mistakes holding uniformly over all realizable data sequences. In contrast, surprisingly, we show that in the case of \emph{universal} learnability of concept classes $\mathcal{C}$, there is an exact equivalence between multiclass online learnability under bandit feedback and full supervision, in both the realizable and agnostic settings. \par More specifically, our first main contribution is a theory that establishes an inherent dichotomy in multiclass online learning under bandit feedback within the realizable setting. In particular, for any concept class $\mathcal{C}$ even when the effective label space is unbounded, we have: (1) If $\mathcal{C}$ does not have an infinite multiclass Littlestone tree, then there is a deterministic online learner that makes only finitely many mistakes against any realizable adversary, crucially without placing a uniform bound on the number of mistakes. (2) If $\mathcal{C}$ has an infinite multiclass Littlestone tree, then there is a strategy for the realizable adversary that forces any learner, including randomized, to make linear expected number of mistakes. Furthermore, our second main contribution reveals a similar trend in the agnostic setting.
APA
Hanneke, S., Shaeiri, A. & Wang, H.. (2025). For Universal Multiclass Online Learning, Bandit Feedback and Full Supervision are Equivalent. Proceedings of The 36th International Conference on Algorithmic Learning Theory, in Proceedings of Machine Learning Research 272:539-559 Available from https://proceedings.mlr.press/v272/hanneke25b.html.

Related Material