Delayed Feedback in Generalised Linear Bandits Revisited

Benjamin Howson; Ciara Pike-Burke; Sarah Filippi

Delayed Feedback in Generalised Linear Bandits Revisited

Benjamin Howson, Ciara Pike-Burke, Sarah Filippi

Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, PMLR 206:6095-6119, 2023.

Abstract

The stochastic generalised linear bandit is a well-understood model for sequential decision-making problems, with many algorithms achieving near-optimal regret guarantees under immediate feedback. However, the stringent requirement for immediate rewards is unmet in many real-world applications where the reward is almost always delayed. We study the phenomenon of delayed rewards in generalised linear bandits in a theoretical manner. We show that a natural adaptation of an optimistic algorithm to the delayed feedback setting can achieve regret of

$\widetilde{\mathcal{O}}(d\sqrt{T} + d^{3/2}\mathbb{E}[\tau]\,)$ , where

$\mathbb{E}[\tau]$ denotes the expected delay,

$d$ is the dimension and

$T$ is the time horizon. This significantly improves upon existing approaches for this setting where the best known regret bound was

$\widetilde{\mathcal{O}}(\sqrt{dT}\sqrt{d + \mathbb{E}[\tau]}\,)$ . We verify our theoretical results through experiments on simulated data.

Cite this Paper

BibTeX


@InProceedings{pmlr-v206-howson23b,
  title = 	 {Delayed Feedback in Generalised Linear Bandits Revisited},
  author =       {Howson, Benjamin and Pike-Burke, Ciara and Filippi, Sarah},
  booktitle = 	 {Proceedings of The 26th International Conference on Artificial Intelligence and Statistics},
  pages = 	 {6095--6119},
  year = 	 {2023},
  editor = 	 {Ruiz, Francisco and Dy, Jennifer and van de Meent, Jan-Willem},
  volume = 	 {206},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {25--27 Apr},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v206/howson23b/howson23b.pdf},
  url = 	 {https://proceedings.mlr.press/v206/howson23b.html},
  abstract = 	 {The stochastic generalised linear bandit is a well-understood model for sequential decision-making problems, with many algorithms achieving near-optimal regret guarantees under immediate feedback. However, the stringent requirement for immediate rewards is unmet in many real-world applications where the reward is almost always delayed. We study the phenomenon of delayed rewards in generalised linear bandits in a theoretical manner. We show that a natural adaptation of an optimistic algorithm to the delayed feedback setting can achieve regret of $\widetilde{\mathcal{O}}(d\sqrt{T} + d^{3/2}\mathbb{E}[\tau]\,)$, where $\mathbb{E}[\tau]$ denotes the expected delay, $d$ is the dimension and $T$ is the time horizon. This significantly improves upon existing approaches for this setting where the best known regret bound was $ \widetilde{\mathcal{O}}(\sqrt{dT}\sqrt{d + \mathbb{E}[\tau]}\,)$. We verify our theoretical results through experiments on simulated data.}
}

Endnote

%0 Conference Paper
%T Delayed Feedback in Generalised Linear Bandits Revisited
%A Benjamin Howson
%A Ciara Pike-Burke
%A Sarah Filippi
%B Proceedings of The 26th International Conference on Artificial Intelligence and Statistics
%C Proceedings of Machine Learning Research
%D 2023
%E Francisco Ruiz
%E Jennifer Dy
%E Jan-Willem van de Meent	
%F pmlr-v206-howson23b
%I PMLR
%P 6095--6119
%U https://proceedings.mlr.press/v206/howson23b.html
%V 206
%X The stochastic generalised linear bandit is a well-understood model for sequential decision-making problems, with many algorithms achieving near-optimal regret guarantees under immediate feedback. However, the stringent requirement for immediate rewards is unmet in many real-world applications where the reward is almost always delayed. We study the phenomenon of delayed rewards in generalised linear bandits in a theoretical manner. We show that a natural adaptation of an optimistic algorithm to the delayed feedback setting can achieve regret of $\widetilde{\mathcal{O}}(d\sqrt{T} + d^{3/2}\mathbb{E}[\tau]\,)$, where $\mathbb{E}[\tau]$ denotes the expected delay, $d$ is the dimension and $T$ is the time horizon. This significantly improves upon existing approaches for this setting where the best known regret bound was $ \widetilde{\mathcal{O}}(\sqrt{dT}\sqrt{d + \mathbb{E}[\tau]}\,)$. We verify our theoretical results through experiments on simulated data.

APA


Howson, B., Pike-Burke, C. & Filippi, S.. (2023). Delayed Feedback in Generalised Linear Bandits Revisited. Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 206:6095-6119 Available from https://proceedings.mlr.press/v206/howson23b.html.

Related Material

Download PDF