Adaptive ExplorationExploitation Tradeoff for Opportunistic Bandits
[edit]
Proceedings of the 35th International Conference on Machine Learning, PMLR 80:53065314, 2018.
Abstract
In this paper, we propose and study opportunistic bandits  a new variant of bandits where the regret of pulling a suboptimal arm varies under different environmental conditions, such as network load or produce price. When the load/price is low, so is the cost/regret of pulling a suboptimal arm (e.g., trying a suboptimal network configuration). Therefore, intuitively, we could explore more when the load/price is low and exploit more when the load/price is high. Inspired by this intuition, we propose an Adaptive UpperConfidenceBound (AdaUCB) algorithm to adaptively balance the explorationexploitation tradeoff for opportunistic bandits. We prove that AdaUCB achieves O(log T) regret with a smaller coefficient than the traditional UCB algorithm. Furthermore, AdaUCB achieves O(1) regret with respect to T if the exploration cost is zero when the load level is below a certain threshold. Last, based on both synthetic data and realworld traces, experimental results show that AdaUCB significantly outperforms other bandit algorithms, such as UCB and TS (Thompson Sampling), under large load/price fluctuations.
Related Material


