[edit]
Delayed Feedback in Generalised Linear Bandits Revisited
Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, PMLR 206:6095-6119, 2023.
Abstract
The stochastic generalised linear bandit is a well-understood model for sequential decision-making problems, with many algorithms achieving near-optimal regret guarantees under immediate feedback. However, the stringent requirement for immediate rewards is unmet in many real-world applications where the reward is almost always delayed. We study the phenomenon of delayed rewards in generalised linear bandits in a theoretical manner. We show that a natural adaptation of an optimistic algorithm to the delayed feedback setting can achieve regret of ˜O(d√T+d3/2E[τ]), where E[τ] denotes the expected delay, d is the dimension and T is the time horizon. This significantly improves upon existing approaches for this setting where the best known regret bound was ˜O(√dT√d+E[τ]). We verify our theoretical results through experiments on simulated data.