Combinatorial Pure Exploration for Dueling Bandit

Wei Chen; Yihan Du; Longbo Huang; Haoyu Zhao

Combinatorial Pure Exploration for Dueling Bandit

Wei Chen, Yihan Du, Longbo Huang, Haoyu Zhao

Proceedings of the 37th International Conference on Machine Learning, PMLR 119:1531-1541, 2020.

Abstract

In this paper, we study combinatorial pure exploration for dueling bandits (CPE-DB): we have multiple candidates for multiple positions as modeled by a bipartite graph, and in each round we sample a duel of two candidates on one position and observe who wins in the duel, with the goal of finding the best candidate-position matching with high probability after multiple rounds of samples. CPE-DB is an adaptation of the original combinatorial pure exploration for multi-armed bandit (CPE-MAB) problem to the dueling bandit setting. We consider both the Borda winner and the Condorcet winner cases. For Borda winner, we establish a reduction of the problem to the original CPE-MAB setting and design PAC and exact algorithms that achieve both the sample complexity similar to that in the CPE-MAB setting (which is nearly optimal for a subclass of problems) and polynomial running time per round. For Condorcet winner, we first design a fully polynomial time approximation scheme (FPTAS) for the offline problem of finding the Condorcet winner with known winning probabilities, and then use the FPTAS as an oracle to design a novel pure exploration algorithm CAR-Cond with sample complexity analysis. CAR-Cond is the first algorithm with polynomial running time per round for identifying the Condorcet winner in CPE-DB.

Cite this Paper

BibTeX

@InProceedings{pmlr-v119-chen20d,
  title = 	 {Combinatorial Pure Exploration for Dueling Bandit},
  author =       {Chen, Wei and Du, Yihan and Huang, Longbo and Zhao, Haoyu},
  booktitle = 	 {Proceedings of the 37th International Conference on Machine Learning},
  pages = 	 {1531--1541},
  year = 	 {2020},
  editor = 	 {III, Hal Daumé and Singh, Aarti},
  volume = 	 {119},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {13--18 Jul},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v119/chen20d/chen20d.pdf},
  url = 	 {https://proceedings.mlr.press/v119/chen20d.html},
  abstract = 	 {In this paper, we study combinatorial pure exploration for dueling bandits (CPE-DB): we have multiple candidates for multiple positions as modeled by a bipartite graph, and in each round we sample a duel of two candidates on one position and observe who wins in the duel, with the goal of finding the best candidate-position matching with high probability after multiple rounds of samples. CPE-DB is an adaptation of the original combinatorial pure exploration for multi-armed bandit (CPE-MAB) problem to the dueling bandit setting. We consider both the Borda winner and the Condorcet winner cases. For Borda winner, we establish a reduction of the problem to the original CPE-MAB setting and design PAC and exact algorithms that achieve both the sample complexity similar to that in the CPE-MAB setting (which is nearly optimal for a subclass of problems) and polynomial running time per round. For Condorcet winner, we first design a fully polynomial time approximation scheme (FPTAS) for the offline problem of finding the Condorcet winner with known winning probabilities, and then use the FPTAS as an oracle to design a novel pure exploration algorithm CAR-Cond with sample complexity analysis. CAR-Cond is the first algorithm with polynomial running time per round for identifying the Condorcet winner in CPE-DB.}
}

Endnote

%0 Conference Paper
%T Combinatorial Pure Exploration for Dueling Bandit
%A Wei Chen
%A Yihan Du
%A Longbo Huang
%A Haoyu Zhao
%B Proceedings of the 37th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2020
%E Hal Daumé III
%E Aarti Singh	
%F pmlr-v119-chen20d
%I PMLR
%P 1531--1541
%U https://proceedings.mlr.press/v119/chen20d.html
%V 119
%X In this paper, we study combinatorial pure exploration for dueling bandits (CPE-DB): we have multiple candidates for multiple positions as modeled by a bipartite graph, and in each round we sample a duel of two candidates on one position and observe who wins in the duel, with the goal of finding the best candidate-position matching with high probability after multiple rounds of samples. CPE-DB is an adaptation of the original combinatorial pure exploration for multi-armed bandit (CPE-MAB) problem to the dueling bandit setting. We consider both the Borda winner and the Condorcet winner cases. For Borda winner, we establish a reduction of the problem to the original CPE-MAB setting and design PAC and exact algorithms that achieve both the sample complexity similar to that in the CPE-MAB setting (which is nearly optimal for a subclass of problems) and polynomial running time per round. For Condorcet winner, we first design a fully polynomial time approximation scheme (FPTAS) for the offline problem of finding the Condorcet winner with known winning probabilities, and then use the FPTAS as an oracle to design a novel pure exploration algorithm CAR-Cond with sample complexity analysis. CAR-Cond is the first algorithm with polynomial running time per round for identifying the Condorcet winner in CPE-DB.

APA

Chen, W., Du, Y., Huang, L. & Zhao, H.. (2020). Combinatorial Pure Exploration for Dueling Bandit. Proceedings of the 37th International Conference on Machine Learning, in Proceedings of Machine Learning Research 119:1531-1541 Available from https://proceedings.mlr.press/v119/chen20d.html.

Combinatorial Pure Exploration for Dueling Bandit

Abstract

Cite this Paper

Related Material