[edit]
Tight Regret Bounds for Single-pass Streaming Multi-armed Bandits
Proceedings of the 40th International Conference on Machine Learning, PMLR 202:35525-35547, 2023.
Abstract
Regret minimization in streaming multi-armed bandits (MABs) has been studied extensively, and recent work has shown that algorithms with o(K) memory have to incur Ω(T2/3) regret, where K and T are the numbers of arms and trials. However, the previous best regret upper bound is still O(K1/3T2/3log1/3(T)), which is achieved by the simple uniform exploration algorithm. In this paper, we close this gap and complete the picture of regret minimization in single-pass streaming MABs. We first improve the regret lower bound to Ω(K1/3T2/3) for algorithms with o(K) memory. We then show that the log1/3(T) factor is not necessary by designing algorithms with at most O(log∗(K))-arm memory and achieve O(K1/3T2/3) expected regret based on streaming ε-best arm algorithms. We further tested the empirical performances of our algorithms on simulated MABs instances, where the proposed algorithms outperform the benchmark uniform exploration algorithm by a large margin and, on occasion, reduce the regret by up to 70%.