[edit]
For Universal Multiclass Online Learning, Bandit Feedback and Full Supervision are Equivalent
Proceedings of The 36th International Conference on Algorithmic Learning Theory, PMLR 272:539-559, 2025.
Abstract
We study the problem of multiclass online learning under \emph{bandit feedback} within the framework of \emph{universal learning} [Bousquet, Hanneke, Moran, van Handel, and Yehudayoff; STOC ’21]. \par In multiclass online learning under bandit feedback, it is well known that no concept class $\mathcal{C}$ is \emph{uniformly} learnable when the effective label space is unbounded, or in other words, no online learner guarantees a finite bound on the expected number of mistakes holding uniformly over all realizable data sequences. In contrast, surprisingly, we show that in the case of \emph{universal} learnability of concept classes $\mathcal{C}$, there is an exact equivalence between multiclass online learnability under bandit feedback and full supervision, in both the realizable and agnostic settings. \par More specifically, our first main contribution is a theory that establishes an inherent dichotomy in multiclass online learning under bandit feedback within the realizable setting. In particular, for any concept class $\mathcal{C}$ even when the effective label space is unbounded, we have: (1) If $\mathcal{C}$ does not have an infinite multiclass Littlestone tree, then there is a deterministic online learner that makes only finitely many mistakes against any realizable adversary, crucially without placing a uniform bound on the number of mistakes. (2) If $\mathcal{C}$ has an infinite multiclass Littlestone tree, then there is a strategy for the realizable adversary that forces any learner, including randomized, to make linear expected number of mistakes. Furthermore, our second main contribution reveals a similar trend in the agnostic setting.