Adaptive Exploration-Exploitation Tradeoff for Opportunistic Bandits

Huasen Wu; Xueying Guo; Xin Liu

Adaptive Exploration-Exploitation Tradeoff for Opportunistic Bandits

Huasen Wu, Xueying Guo, Xin Liu

Proceedings of the 35th International Conference on Machine Learning, PMLR 80:5306-5314, 2018.

Abstract

In this paper, we propose and study opportunistic bandits - a new variant of bandits where the regret of pulling a suboptimal arm varies under different environmental conditions, such as network load or produce price. When the load/price is low, so is the cost/regret of pulling a suboptimal arm (e.g., trying a suboptimal network configuration). Therefore, intuitively, we could explore more when the load/price is low and exploit more when the load/price is high. Inspired by this intuition, we propose an Adaptive Upper-Confidence-Bound (AdaUCB) algorithm to adaptively balance the exploration-exploitation tradeoff for opportunistic bandits. We prove that AdaUCB achieves O(log T) regret with a smaller coefficient than the traditional UCB algorithm. Furthermore, AdaUCB achieves O(1) regret with respect to T if the exploration cost is zero when the load level is below a certain threshold. Last, based on both synthetic data and real-world traces, experimental results show that AdaUCB significantly outperforms other bandit algorithms, such as UCB and TS (Thompson Sampling), under large load/price fluctuations.

Cite this Paper

BibTeX

@InProceedings{pmlr-v80-wu18b,
  title = 	 {Adaptive Exploration-Exploitation Tradeoff for Opportunistic Bandits},
  author =       {Wu, Huasen and Guo, Xueying and Liu, Xin},
  booktitle = 	 {Proceedings of the 35th International Conference on Machine Learning},
  pages = 	 {5306--5314},
  year = 	 {2018},
  editor = 	 {Dy, Jennifer and Krause, Andreas},
  volume = 	 {80},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {10--15 Jul},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v80/wu18b/wu18b.pdf},
  url = 	 {https://proceedings.mlr.press/v80/wu18b.html},
  abstract = 	 {In this paper, we propose and study opportunistic bandits - a new variant of bandits where the regret of pulling a suboptimal arm varies under different environmental conditions, such as network load or produce price. When the load/price is low, so is the cost/regret of pulling a suboptimal arm (e.g., trying a suboptimal network configuration). Therefore, intuitively, we could explore more when the load/price is low and exploit more when the load/price is high. Inspired by this intuition, we propose an Adaptive Upper-Confidence-Bound (AdaUCB) algorithm to adaptively balance the exploration-exploitation tradeoff for opportunistic bandits. We prove that AdaUCB achieves O(log T) regret with a smaller coefficient than the traditional UCB algorithm. Furthermore, AdaUCB achieves O(1) regret with respect to T if the exploration cost is zero when the load level is below a certain threshold. Last, based on both synthetic data and real-world traces, experimental results show that AdaUCB significantly outperforms other bandit algorithms, such as UCB and TS (Thompson Sampling), under large load/price fluctuations.}
}

Endnote

%0 Conference Paper
%T Adaptive Exploration-Exploitation Tradeoff for Opportunistic Bandits
%A Huasen Wu
%A Xueying Guo
%A Xin Liu
%B Proceedings of the 35th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2018
%E Jennifer Dy
%E Andreas Krause	
%F pmlr-v80-wu18b
%I PMLR
%P 5306--5314
%U https://proceedings.mlr.press/v80/wu18b.html
%V 80
%X In this paper, we propose and study opportunistic bandits - a new variant of bandits where the regret of pulling a suboptimal arm varies under different environmental conditions, such as network load or produce price. When the load/price is low, so is the cost/regret of pulling a suboptimal arm (e.g., trying a suboptimal network configuration). Therefore, intuitively, we could explore more when the load/price is low and exploit more when the load/price is high. Inspired by this intuition, we propose an Adaptive Upper-Confidence-Bound (AdaUCB) algorithm to adaptively balance the exploration-exploitation tradeoff for opportunistic bandits. We prove that AdaUCB achieves O(log T) regret with a smaller coefficient than the traditional UCB algorithm. Furthermore, AdaUCB achieves O(1) regret with respect to T if the exploration cost is zero when the load level is below a certain threshold. Last, based on both synthetic data and real-world traces, experimental results show that AdaUCB significantly outperforms other bandit algorithms, such as UCB and TS (Thompson Sampling), under large load/price fluctuations.

APA

Wu, H., Guo, X. & Liu, X.. (2018). Adaptive Exploration-Exploitation Tradeoff for Opportunistic Bandits. Proceedings of the 35th International Conference on Machine Learning, in Proceedings of Machine Learning Research 80:5306-5314 Available from https://proceedings.mlr.press/v80/wu18b.html.

Adaptive Exploration-Exploitation Tradeoff for Opportunistic Bandits

Abstract

Cite this Paper

Related Material