[edit]
Banker Online Mirror Descent: A Universal Approach for Delayed Online Bandit Learning
Proceedings of the 40th International Conference on Machine Learning, PMLR 202:13814-13844, 2023.
Abstract
We propose Banker Online Mirror Descent (Banker-OMD), a novel framework generalizing the classical Online Mirror Descent (OMD) technique in the online learning literature. The Banker-OMD framework almost completely decouples feedback delay handling and the task-specific OMD algorithm design, thus facilitating the design of new algorithms capable of efficiently and robustly handling feedback delays. Specifically, it offers a general methodology for achieving ˜O(√T+√D)-style regret bounds in online bandit learning tasks with delayed feedback, where T is the number of rounds and D is the total feedback delay. We demonstrate the power of Banker-OMD by applications to two important bandit learning scenarios with delayed feedback, including delayed scale-free adversarial Multi-Armed Bandits (MAB) and delayed adversarial linear bandits. Banker-OMD leads to the first delayed scale-free adversarial MAB algorithm achieving ˜O(√KL(√T+√D)) regret and the first delayed adversarial linear bandit algorithm achieving ˜O(poly(n)(√T+√D)) regret. As a corollary, the first application also implies ˜O(√KTL) regret for non-delayed scale-free adversarial MABs, which is the first to match the Ω(√KTL) lower bound up to logarithmic factors and can be of independent interest.