Linear bandits with Stochastic Delayed Feedback

Claire Vernade, Alexandra Carpentier, Tor Lattimore, Giovanni Zappella, Beyza Ermis, Michael Brückner
Proceedings of the 37th International Conference on Machine Learning, PMLR 119:9712-9721, 2020.

Abstract

Stochastic linear bandits are a natural and well-studied model for structured exploration/exploitation problems and are widely used in applications such as on-line marketing and recommendation. One of the main challenges faced by practitioners hoping to apply existing algorithms is that usually the feedback is randomly delayed and delays are only partially observable. For example, while a purchase is usually observable some time after the display, the decision of not buying is never explicitly sent to the system. In other words, the learner only observes delayed positive events. We formalize this problem as a novel stochastic delayed linear bandit and propose OTFLinUCB and OTFLinTS, two computationally efficient algorithms able to integrate new information as it becomes available and to deal with the permanently censored feedback. We prove optimal O(d\sqrt{T}) bounds on the regret of the first algorithm and study the dependency on delay-dependent parameters. Our model, assumptions and results are validated by experiments on simulated and real data.

Cite this Paper


BibTeX
@InProceedings{pmlr-v119-vernade20a, title = {Linear bandits with Stochastic Delayed Feedback}, author = {Vernade, Claire and Carpentier, Alexandra and Lattimore, Tor and Zappella, Giovanni and Ermis, Beyza and Br{\"u}ckner, Michael}, booktitle = {Proceedings of the 37th International Conference on Machine Learning}, pages = {9712--9721}, year = {2020}, editor = {III, Hal Daumé and Singh, Aarti}, volume = {119}, series = {Proceedings of Machine Learning Research}, month = {13--18 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v119/vernade20a/vernade20a.pdf}, url = {https://proceedings.mlr.press/v119/vernade20a.html}, abstract = {Stochastic linear bandits are a natural and well-studied model for structured exploration/exploitation problems and are widely used in applications such as on-line marketing and recommendation. One of the main challenges faced by practitioners hoping to apply existing algorithms is that usually the feedback is randomly delayed and delays are only partially observable. For example, while a purchase is usually observable some time after the display, the decision of not buying is never explicitly sent to the system. In other words, the learner only observes delayed positive events. We formalize this problem as a novel stochastic delayed linear bandit and propose OTFLinUCB and OTFLinTS, two computationally efficient algorithms able to integrate new information as it becomes available and to deal with the permanently censored feedback. We prove optimal O(d\sqrt{T}) bounds on the regret of the first algorithm and study the dependency on delay-dependent parameters. Our model, assumptions and results are validated by experiments on simulated and real data.} }
Endnote
%0 Conference Paper %T Linear bandits with Stochastic Delayed Feedback %A Claire Vernade %A Alexandra Carpentier %A Tor Lattimore %A Giovanni Zappella %A Beyza Ermis %A Michael Brückner %B Proceedings of the 37th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2020 %E Hal Daumé III %E Aarti Singh %F pmlr-v119-vernade20a %I PMLR %P 9712--9721 %U https://proceedings.mlr.press/v119/vernade20a.html %V 119 %X Stochastic linear bandits are a natural and well-studied model for structured exploration/exploitation problems and are widely used in applications such as on-line marketing and recommendation. One of the main challenges faced by practitioners hoping to apply existing algorithms is that usually the feedback is randomly delayed and delays are only partially observable. For example, while a purchase is usually observable some time after the display, the decision of not buying is never explicitly sent to the system. In other words, the learner only observes delayed positive events. We formalize this problem as a novel stochastic delayed linear bandit and propose OTFLinUCB and OTFLinTS, two computationally efficient algorithms able to integrate new information as it becomes available and to deal with the permanently censored feedback. We prove optimal O(d\sqrt{T}) bounds on the regret of the first algorithm and study the dependency on delay-dependent parameters. Our model, assumptions and results are validated by experiments on simulated and real data.
APA
Vernade, C., Carpentier, A., Lattimore, T., Zappella, G., Ermis, B. & Brückner, M.. (2020). Linear bandits with Stochastic Delayed Feedback. Proceedings of the 37th International Conference on Machine Learning, in Proceedings of Machine Learning Research 119:9712-9721 Available from https://proceedings.mlr.press/v119/vernade20a.html.

Related Material