Short-lived High-volume Bandits

Su Jia, Nishant Oli, Ian Anderson, Paul Duff, Andrew A Li, R. Ravi
Proceedings of the 40th International Conference on Machine Learning, PMLR 202:14902-14929, 2023.

Abstract

Modern platforms leverage randomized experiments to make informed decisions from a given set of alternatives. As a particularly challenging scenario, these alternatives can potentially have (i) high volume, with thousands of new items being released each hour, and (ii) short lifetime, either due to the contents’ transient nature, or some underlying non-stationarity that impels the learner to treat the same item as non-identical copies across time. We consider a multiplay bandits model. In each round a set of $k=n^\rho$ actions that will be available for $w$ rounds arrives, each of whose mean reward is drawn from a fixed known distribution. The learner selects a multiset of $n$ actions at a time. We propose an $\ell$-Layered Sieve Policy that recursively refines the action space for $\ell\leq w$ times. We show that for any given $\rho>0$, with suitable $\ell$, the policy achieves $\tilde O (n^{-\min \{\rho, \frac 12 (1+\frac 1w)^{-1}\}})$ regret. We also complement this result with an $\Omega (n^{-\min \{\rho, \frac 12\}})$ lower bound. We further validate the effectiveness of our Sieve Policy via numerical simulations and a field experiment in a large content card serving platform.

Cite this Paper


BibTeX
@InProceedings{pmlr-v202-jia23b, title = {Short-lived High-volume Bandits}, author = {Jia, Su and Oli, Nishant and Anderson, Ian and Duff, Paul and Li, Andrew A and Ravi, R.}, booktitle = {Proceedings of the 40th International Conference on Machine Learning}, pages = {14902--14929}, year = {2023}, editor = {Krause, Andreas and Brunskill, Emma and Cho, Kyunghyun and Engelhardt, Barbara and Sabato, Sivan and Scarlett, Jonathan}, volume = {202}, series = {Proceedings of Machine Learning Research}, month = {23--29 Jul}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v202/jia23b/jia23b.pdf}, url = {https://proceedings.mlr.press/v202/jia23b.html}, abstract = {Modern platforms leverage randomized experiments to make informed decisions from a given set of alternatives. As a particularly challenging scenario, these alternatives can potentially have (i) high volume, with thousands of new items being released each hour, and (ii) short lifetime, either due to the contents’ transient nature, or some underlying non-stationarity that impels the learner to treat the same item as non-identical copies across time. We consider a multiplay bandits model. In each round a set of $k=n^\rho$ actions that will be available for $w$ rounds arrives, each of whose mean reward is drawn from a fixed known distribution. The learner selects a multiset of $n$ actions at a time. We propose an $\ell$-Layered Sieve Policy that recursively refines the action space for $\ell\leq w$ times. We show that for any given $\rho>0$, with suitable $\ell$, the policy achieves $\tilde O (n^{-\min \{\rho, \frac 12 (1+\frac 1w)^{-1}\}})$ regret. We also complement this result with an $\Omega (n^{-\min \{\rho, \frac 12\}})$ lower bound. We further validate the effectiveness of our Sieve Policy via numerical simulations and a field experiment in a large content card serving platform.} }
Endnote
%0 Conference Paper %T Short-lived High-volume Bandits %A Su Jia %A Nishant Oli %A Ian Anderson %A Paul Duff %A Andrew A Li %A R. Ravi %B Proceedings of the 40th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2023 %E Andreas Krause %E Emma Brunskill %E Kyunghyun Cho %E Barbara Engelhardt %E Sivan Sabato %E Jonathan Scarlett %F pmlr-v202-jia23b %I PMLR %P 14902--14929 %U https://proceedings.mlr.press/v202/jia23b.html %V 202 %X Modern platforms leverage randomized experiments to make informed decisions from a given set of alternatives. As a particularly challenging scenario, these alternatives can potentially have (i) high volume, with thousands of new items being released each hour, and (ii) short lifetime, either due to the contents’ transient nature, or some underlying non-stationarity that impels the learner to treat the same item as non-identical copies across time. We consider a multiplay bandits model. In each round a set of $k=n^\rho$ actions that will be available for $w$ rounds arrives, each of whose mean reward is drawn from a fixed known distribution. The learner selects a multiset of $n$ actions at a time. We propose an $\ell$-Layered Sieve Policy that recursively refines the action space for $\ell\leq w$ times. We show that for any given $\rho>0$, with suitable $\ell$, the policy achieves $\tilde O (n^{-\min \{\rho, \frac 12 (1+\frac 1w)^{-1}\}})$ regret. We also complement this result with an $\Omega (n^{-\min \{\rho, \frac 12\}})$ lower bound. We further validate the effectiveness of our Sieve Policy via numerical simulations and a field experiment in a large content card serving platform.
APA
Jia, S., Oli, N., Anderson, I., Duff, P., Li, A.A. & Ravi, R.. (2023). Short-lived High-volume Bandits. Proceedings of the 40th International Conference on Machine Learning, in Proceedings of Machine Learning Research 202:14902-14929 Available from https://proceedings.mlr.press/v202/jia23b.html.

Related Material