Leveraging (Biased) Information: Multi-armed Bandits with Offline Data

Wang Chi Cheung, Lixing Lyu
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:8286-8309, 2024.

Abstract

We leverage offline data to facilitate online learning in stochastic multi-armed bandits. The probability distributions that govern the offline data and the online rewards can be different. Without any non-trival upper bound on their difference, we show that no non-anticipatory policy can out-perform the UCB policy by (Auer et al. 2002), even in the presence of offline data. In complement, we propose an online policy MIN-UCB, which outperforms UCB when a non-trivial upper bound is given. MIN-UCB adaptively chooses to utilize the offline data when they are deemed informative, and to ignore them otherwise. MIN-UCB is shown to be tight in terms of both instance indepedent and dependent regret bounds. Finally, we corroborate the theoretical results with numerical experiments.

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-cheung24a, title = {Leveraging ({B}iased) Information: Multi-armed Bandits with Offline Data}, author = {Cheung, Wang Chi and Lyu, Lixing}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {8286--8309}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/cheung24a/cheung24a.pdf}, url = {https://proceedings.mlr.press/v235/cheung24a.html}, abstract = {We leverage offline data to facilitate online learning in stochastic multi-armed bandits. The probability distributions that govern the offline data and the online rewards can be different. Without any non-trival upper bound on their difference, we show that no non-anticipatory policy can out-perform the UCB policy by (Auer et al. 2002), even in the presence of offline data. In complement, we propose an online policy MIN-UCB, which outperforms UCB when a non-trivial upper bound is given. MIN-UCB adaptively chooses to utilize the offline data when they are deemed informative, and to ignore them otherwise. MIN-UCB is shown to be tight in terms of both instance indepedent and dependent regret bounds. Finally, we corroborate the theoretical results with numerical experiments.} }
Endnote
%0 Conference Paper %T Leveraging (Biased) Information: Multi-armed Bandits with Offline Data %A Wang Chi Cheung %A Lixing Lyu %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-cheung24a %I PMLR %P 8286--8309 %U https://proceedings.mlr.press/v235/cheung24a.html %V 235 %X We leverage offline data to facilitate online learning in stochastic multi-armed bandits. The probability distributions that govern the offline data and the online rewards can be different. Without any non-trival upper bound on their difference, we show that no non-anticipatory policy can out-perform the UCB policy by (Auer et al. 2002), even in the presence of offline data. In complement, we propose an online policy MIN-UCB, which outperforms UCB when a non-trivial upper bound is given. MIN-UCB adaptively chooses to utilize the offline data when they are deemed informative, and to ignore them otherwise. MIN-UCB is shown to be tight in terms of both instance indepedent and dependent regret bounds. Finally, we corroborate the theoretical results with numerical experiments.
APA
Cheung, W.C. & Lyu, L.. (2024). Leveraging (Biased) Information: Multi-armed Bandits with Offline Data. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:8286-8309 Available from https://proceedings.mlr.press/v235/cheung24a.html.

Related Material