Delayed Feedback in Generalised Linear Bandits Revisited

Benjamin Howson, Ciara Pike-Burke, Sarah Filippi
Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, PMLR 206:6095-6119, 2023.

Abstract

The stochastic generalised linear bandit is a well-understood model for sequential decision-making problems, with many algorithms achieving near-optimal regret guarantees under immediate feedback. However, the stringent requirement for immediate rewards is unmet in many real-world applications where the reward is almost always delayed. We study the phenomenon of delayed rewards in generalised linear bandits in a theoretical manner. We show that a natural adaptation of an optimistic algorithm to the delayed feedback setting can achieve regret of $\widetilde{\mathcal{O}}(d\sqrt{T} + d^{3/2}\mathbb{E}[\tau]\,)$, where $\mathbb{E}[\tau]$ denotes the expected delay, $d$ is the dimension and $T$ is the time horizon. This significantly improves upon existing approaches for this setting where the best known regret bound was $ \widetilde{\mathcal{O}}(\sqrt{dT}\sqrt{d + \mathbb{E}[\tau]}\,)$. We verify our theoretical results through experiments on simulated data.

Cite this Paper


BibTeX
@InProceedings{pmlr-v206-howson23b, title = {Delayed Feedback in Generalised Linear Bandits Revisited}, author = {Howson, Benjamin and Pike-Burke, Ciara and Filippi, Sarah}, booktitle = {Proceedings of The 26th International Conference on Artificial Intelligence and Statistics}, pages = {6095--6119}, year = {2023}, editor = {Ruiz, Francisco and Dy, Jennifer and van de Meent, Jan-Willem}, volume = {206}, series = {Proceedings of Machine Learning Research}, month = {25--27 Apr}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v206/howson23b/howson23b.pdf}, url = {https://proceedings.mlr.press/v206/howson23b.html}, abstract = {The stochastic generalised linear bandit is a well-understood model for sequential decision-making problems, with many algorithms achieving near-optimal regret guarantees under immediate feedback. However, the stringent requirement for immediate rewards is unmet in many real-world applications where the reward is almost always delayed. We study the phenomenon of delayed rewards in generalised linear bandits in a theoretical manner. We show that a natural adaptation of an optimistic algorithm to the delayed feedback setting can achieve regret of $\widetilde{\mathcal{O}}(d\sqrt{T} + d^{3/2}\mathbb{E}[\tau]\,)$, where $\mathbb{E}[\tau]$ denotes the expected delay, $d$ is the dimension and $T$ is the time horizon. This significantly improves upon existing approaches for this setting where the best known regret bound was $ \widetilde{\mathcal{O}}(\sqrt{dT}\sqrt{d + \mathbb{E}[\tau]}\,)$. We verify our theoretical results through experiments on simulated data.} }
Endnote
%0 Conference Paper %T Delayed Feedback in Generalised Linear Bandits Revisited %A Benjamin Howson %A Ciara Pike-Burke %A Sarah Filippi %B Proceedings of The 26th International Conference on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 2023 %E Francisco Ruiz %E Jennifer Dy %E Jan-Willem van de Meent %F pmlr-v206-howson23b %I PMLR %P 6095--6119 %U https://proceedings.mlr.press/v206/howson23b.html %V 206 %X The stochastic generalised linear bandit is a well-understood model for sequential decision-making problems, with many algorithms achieving near-optimal regret guarantees under immediate feedback. However, the stringent requirement for immediate rewards is unmet in many real-world applications where the reward is almost always delayed. We study the phenomenon of delayed rewards in generalised linear bandits in a theoretical manner. We show that a natural adaptation of an optimistic algorithm to the delayed feedback setting can achieve regret of $\widetilde{\mathcal{O}}(d\sqrt{T} + d^{3/2}\mathbb{E}[\tau]\,)$, where $\mathbb{E}[\tau]$ denotes the expected delay, $d$ is the dimension and $T$ is the time horizon. This significantly improves upon existing approaches for this setting where the best known regret bound was $ \widetilde{\mathcal{O}}(\sqrt{dT}\sqrt{d + \mathbb{E}[\tau]}\,)$. We verify our theoretical results through experiments on simulated data.
APA
Howson, B., Pike-Burke, C. & Filippi, S.. (2023). Delayed Feedback in Generalised Linear Bandits Revisited. Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 206:6095-6119 Available from https://proceedings.mlr.press/v206/howson23b.html.

Related Material