[edit]
Adversarial Contextual Bandits Go Kernelized
Proceedings of The 35th International Conference on Algorithmic Learning Theory, PMLR 237:907-929, 2024.
Abstract
We study a generalization of the problem of online learning in adversarial linear contextual bandits by incorporating loss functions that belong to a reproducing kernel Hilbert space, which allows for a more flexible modeling of complex decision-making scenarios. We propose a computationally efficient algorithm that makes use of a new optimistically biased estimator for the loss functions and achieves near-optimal regret guarantees under a variety of eigenvalue decay assumptions made on the underlying kernel. Specifically, under the assumption of polynomial eigendecay with exponent c>1, the regret is ˜O(KT12\pa1+1c), where T denotes the number of rounds and K the number of actions. Furthermore, when the eigendecay follows an exponential pattern, we achieve an even tighter regret bound of \tOO(√T). These rates match the lower bounds in all special cases where lower bounds are known at all, and match the best known upper bounds available for the more well-studied stochastic counterpart of our problem.