For Universal Multiclass Online Learning, Bandit Feedback and Full Supervision are Equivalent

Steve Hanneke, Amirreza Shaeiri, Hongao Wang
Proceedings of The 36th International Conference on Algorithmic Learning Theory, PMLR 272:539-559, 2025.

Abstract

We study the problem of multiclass online learning under \emph{bandit feedback} within the framework of \emph{universal learning} [Bousquet, Hanneke, Moran, van Handel, and Yehudayoff; STOC ’21]. \par In multiclass online learning under bandit feedback, it is well known that no concept class C is \emph{uniformly} learnable when the effective label space is unbounded, or in other words, no online learner guarantees a finite bound on the expected number of mistakes holding uniformly over all realizable data sequences. In contrast, surprisingly, we show that in the case of \emph{universal} learnability of concept classes C, there is an exact equivalence between multiclass online learnability under bandit feedback and full supervision, in both the realizable and agnostic settings. \par More specifically, our first main contribution is a theory that establishes an inherent dichotomy in multiclass online learning under bandit feedback within the realizable setting. In particular, for any concept class C even when the effective label space is unbounded, we have: (1) If C does not have an infinite multiclass Littlestone tree, then there is a deterministic online learner that makes only finitely many mistakes against any realizable adversary, crucially without placing a uniform bound on the number of mistakes. (2) If C has an infinite multiclass Littlestone tree, then there is a strategy for the realizable adversary that forces any learner, including randomized, to make linear expected number of mistakes. Furthermore, our second main contribution reveals a similar trend in the agnostic setting.

Cite this Paper


BibTeX
@InProceedings{pmlr-v272-hanneke25b, title = {For Universal Multiclass Online Learning, Bandit Feedback and Full Supervision are Equivalent}, author = {Hanneke, Steve and Shaeiri, Amirreza and Wang, Hongao}, booktitle = {Proceedings of The 36th International Conference on Algorithmic Learning Theory}, pages = {539--559}, year = {2025}, editor = {Kamath, Gautam and Loh, Po-Ling}, volume = {272}, series = {Proceedings of Machine Learning Research}, month = {24--27 Feb}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v272/main/assets/hanneke25b/hanneke25b.pdf}, url = {https://proceedings.mlr.press/v272/hanneke25b.html}, abstract = {We study the problem of multiclass online learning under \emph{bandit feedback} within the framework of \emph{universal learning} [Bousquet, Hanneke, Moran, van Handel, and Yehudayoff; STOC ’21]. \par In multiclass online learning under bandit feedback, it is well known that no concept class $\mathcal{C}$ is \emph{uniformly} learnable when the effective label space is unbounded, or in other words, no online learner guarantees a finite bound on the expected number of mistakes holding uniformly over all realizable data sequences. In contrast, surprisingly, we show that in the case of \emph{universal} learnability of concept classes $\mathcal{C}$, there is an exact equivalence between multiclass online learnability under bandit feedback and full supervision, in both the realizable and agnostic settings. \par More specifically, our first main contribution is a theory that establishes an inherent dichotomy in multiclass online learning under bandit feedback within the realizable setting. In particular, for any concept class $\mathcal{C}$ even when the effective label space is unbounded, we have: (1) If $\mathcal{C}$ does not have an infinite multiclass Littlestone tree, then there is a deterministic online learner that makes only finitely many mistakes against any realizable adversary, crucially without placing a uniform bound on the number of mistakes. (2) If $\mathcal{C}$ has an infinite multiclass Littlestone tree, then there is a strategy for the realizable adversary that forces any learner, including randomized, to make linear expected number of mistakes. Furthermore, our second main contribution reveals a similar trend in the agnostic setting.} }
Endnote
%0 Conference Paper %T For Universal Multiclass Online Learning, Bandit Feedback and Full Supervision are Equivalent %A Steve Hanneke %A Amirreza Shaeiri %A Hongao Wang %B Proceedings of The 36th International Conference on Algorithmic Learning Theory %C Proceedings of Machine Learning Research %D 2025 %E Gautam Kamath %E Po-Ling Loh %F pmlr-v272-hanneke25b %I PMLR %P 539--559 %U https://proceedings.mlr.press/v272/hanneke25b.html %V 272 %X We study the problem of multiclass online learning under \emph{bandit feedback} within the framework of \emph{universal learning} [Bousquet, Hanneke, Moran, van Handel, and Yehudayoff; STOC ’21]. \par In multiclass online learning under bandit feedback, it is well known that no concept class $\mathcal{C}$ is \emph{uniformly} learnable when the effective label space is unbounded, or in other words, no online learner guarantees a finite bound on the expected number of mistakes holding uniformly over all realizable data sequences. In contrast, surprisingly, we show that in the case of \emph{universal} learnability of concept classes $\mathcal{C}$, there is an exact equivalence between multiclass online learnability under bandit feedback and full supervision, in both the realizable and agnostic settings. \par More specifically, our first main contribution is a theory that establishes an inherent dichotomy in multiclass online learning under bandit feedback within the realizable setting. In particular, for any concept class $\mathcal{C}$ even when the effective label space is unbounded, we have: (1) If $\mathcal{C}$ does not have an infinite multiclass Littlestone tree, then there is a deterministic online learner that makes only finitely many mistakes against any realizable adversary, crucially without placing a uniform bound on the number of mistakes. (2) If $\mathcal{C}$ has an infinite multiclass Littlestone tree, then there is a strategy for the realizable adversary that forces any learner, including randomized, to make linear expected number of mistakes. Furthermore, our second main contribution reveals a similar trend in the agnostic setting.
APA
Hanneke, S., Shaeiri, A. & Wang, H.. (2025). For Universal Multiclass Online Learning, Bandit Feedback and Full Supervision are Equivalent. Proceedings of The 36th International Conference on Algorithmic Learning Theory, in Proceedings of Machine Learning Research 272:539-559 Available from https://proceedings.mlr.press/v272/hanneke25b.html.

Related Material