Stochastic Multi-Armed Bandits with Strongly Reward-Dependent Delays

Yifu Tang, Yingfei Wang, Zeyu Zheng
Proceedings of The 27th International Conference on Artificial Intelligence and Statistics, PMLR 238:3043-3051, 2024.

Abstract

There has been increasing interest in applying multi-armed bandits to adaptive designs in clinical trials. However, most literature assumes that a previous patient’s survival response of a treatment is known before the next patient is treated, which is unrealistic. The inability to account for response delays is cited frequently as one of the problems in using adaptive designs in clinical trials. More critically, the “delays” in observing the survival response are the same as the rewards rather than being external stochastic noise. We formalize this problem as a novel stochastic multi-armed bandit (MAB) problem with reward-dependent delays, where the delay at each round depends on the reward generated on the same round. For general reward/delay distributions with finite expectation, our proposed censored-UCB algorithm achieves near-optimal regret in terms of both problem-dependent and problem-independent bounds. With bounded or sub-Gaussian reward distributions, the upper bounds are optimal with a matching lower bound. Our theoretical results and the algorithms’ effectiveness are validated by empirical experiments.

Cite this Paper


BibTeX
@InProceedings{pmlr-v238-tang24c, title = { Stochastic Multi-Armed Bandits with Strongly Reward-Dependent Delays }, author = {Tang, Yifu and Wang, Yingfei and Zheng, Zeyu}, booktitle = {Proceedings of The 27th International Conference on Artificial Intelligence and Statistics}, pages = {3043--3051}, year = {2024}, editor = {Dasgupta, Sanjoy and Mandt, Stephan and Li, Yingzhen}, volume = {238}, series = {Proceedings of Machine Learning Research}, month = {02--04 May}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v238/tang24c/tang24c.pdf}, url = {https://proceedings.mlr.press/v238/tang24c.html}, abstract = { There has been increasing interest in applying multi-armed bandits to adaptive designs in clinical trials. However, most literature assumes that a previous patient’s survival response of a treatment is known before the next patient is treated, which is unrealistic. The inability to account for response delays is cited frequently as one of the problems in using adaptive designs in clinical trials. More critically, the “delays” in observing the survival response are the same as the rewards rather than being external stochastic noise. We formalize this problem as a novel stochastic multi-armed bandit (MAB) problem with reward-dependent delays, where the delay at each round depends on the reward generated on the same round. For general reward/delay distributions with finite expectation, our proposed censored-UCB algorithm achieves near-optimal regret in terms of both problem-dependent and problem-independent bounds. With bounded or sub-Gaussian reward distributions, the upper bounds are optimal with a matching lower bound. Our theoretical results and the algorithms’ effectiveness are validated by empirical experiments. } }
Endnote
%0 Conference Paper %T Stochastic Multi-Armed Bandits with Strongly Reward-Dependent Delays %A Yifu Tang %A Yingfei Wang %A Zeyu Zheng %B Proceedings of The 27th International Conference on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 2024 %E Sanjoy Dasgupta %E Stephan Mandt %E Yingzhen Li %F pmlr-v238-tang24c %I PMLR %P 3043--3051 %U https://proceedings.mlr.press/v238/tang24c.html %V 238 %X There has been increasing interest in applying multi-armed bandits to adaptive designs in clinical trials. However, most literature assumes that a previous patient’s survival response of a treatment is known before the next patient is treated, which is unrealistic. The inability to account for response delays is cited frequently as one of the problems in using adaptive designs in clinical trials. More critically, the “delays” in observing the survival response are the same as the rewards rather than being external stochastic noise. We formalize this problem as a novel stochastic multi-armed bandit (MAB) problem with reward-dependent delays, where the delay at each round depends on the reward generated on the same round. For general reward/delay distributions with finite expectation, our proposed censored-UCB algorithm achieves near-optimal regret in terms of both problem-dependent and problem-independent bounds. With bounded or sub-Gaussian reward distributions, the upper bounds are optimal with a matching lower bound. Our theoretical results and the algorithms’ effectiveness are validated by empirical experiments.
APA
Tang, Y., Wang, Y. & Zheng, Z.. (2024). Stochastic Multi-Armed Bandits with Strongly Reward-Dependent Delays . Proceedings of The 27th International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 238:3043-3051 Available from https://proceedings.mlr.press/v238/tang24c.html.

Related Material