Efficient and Optimal Algorithms for Contextual Dueling Bandits under Realizability

Aadirupa Saha; Akshay Krishnamurthy

Efficient and Optimal Algorithms for Contextual Dueling Bandits under Realizability

Aadirupa Saha, Akshay Krishnamurthy

Proceedings of The 33rd International Conference on Algorithmic Learning Theory, PMLR 167:968-994, 2022.

Abstract

We study the

$K$ -armed contextual dueling bandit problem, a sequential decision making setting in which the learner uses contextual information to make two decisions, but only observes \emph{preference-based feedback} suggesting that one decision was better than the other. We focus on the regret minimization problem under realizability, where the feedback is generated by a pairwise preference matrix that is well-specified by a given function class

$\mathcal F$ . We provide a new algorithm that achieves the optimal regret rate for a new notion of best response regret, which is a strictly stronger performance measure than those considered in prior works. The algorithm is also computationally efficient, running in polynomial time assuming access to an online oracle for square loss regression over

$\mathcal F$ . This resolves an open problem of Dudik et al. (2015) on oracle efficient, regret-optimal algorithms for contextual dueling bandits.

Cite this Paper

BibTeX


@InProceedings{pmlr-v167-saha22a,
  title = 	 {Efficient and Optimal Algorithms for Contextual Dueling Bandits under Realizability},
  author =       {Saha, Aadirupa and Krishnamurthy, Akshay},
  booktitle = 	 {Proceedings of The 33rd International Conference on Algorithmic Learning Theory},
  pages = 	 {968--994},
  year = 	 {2022},
  editor = 	 {Dasgupta, Sanjoy and Haghtalab, Nika},
  volume = 	 {167},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {29 Mar--01 Apr},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v167/saha22a/saha22a.pdf},
  url = 	 {https://proceedings.mlr.press/v167/saha22a.html},
  abstract = 	 {We study the $K$-armed contextual dueling bandit problem, a sequential decision making setting in which the learner uses contextual information to make two decisions, but only observes \emph{preference-based feedback} suggesting that one decision was better than the other. We focus on the regret minimization problem under realizability, where the feedback is generated by a pairwise preference matrix that is well-specified by a given function class $\mathcal F$. We provide a new algorithm that achieves the optimal regret rate for a new notion of best response regret, which is a strictly stronger performance measure than those considered in prior works. The algorithm is also computationally efficient, running in polynomial time assuming access to an online oracle for square loss regression over $\mathcal F$. This resolves an open problem of Dudik et al. (2015) on oracle efficient, regret-optimal algorithms for contextual dueling bandits.}
}

Endnote

%0 Conference Paper
%T Efficient and Optimal Algorithms for Contextual Dueling Bandits under Realizability
%A Aadirupa Saha
%A Akshay Krishnamurthy
%B Proceedings of The 33rd International Conference on Algorithmic Learning Theory
%C Proceedings of Machine Learning Research
%D 2022
%E Sanjoy Dasgupta
%E Nika Haghtalab	
%F pmlr-v167-saha22a
%I PMLR
%P 968--994
%U https://proceedings.mlr.press/v167/saha22a.html
%V 167
%X We study the $K$-armed contextual dueling bandit problem, a sequential decision making setting in which the learner uses contextual information to make two decisions, but only observes \emph{preference-based feedback} suggesting that one decision was better than the other. We focus on the regret minimization problem under realizability, where the feedback is generated by a pairwise preference matrix that is well-specified by a given function class $\mathcal F$. We provide a new algorithm that achieves the optimal regret rate for a new notion of best response regret, which is a strictly stronger performance measure than those considered in prior works. The algorithm is also computationally efficient, running in polynomial time assuming access to an online oracle for square loss regression over $\mathcal F$. This resolves an open problem of Dudik et al. (2015) on oracle efficient, regret-optimal algorithms for contextual dueling bandits.

APA


Saha, A. & Krishnamurthy, A.. (2022). Efficient and Optimal Algorithms for Contextual Dueling Bandits under Realizability. Proceedings of The 33rd International Conference on Algorithmic Learning Theory, in Proceedings of Machine Learning Research 167:968-994 Available from https://proceedings.mlr.press/v167/saha22a.html.

Related Material

Download PDF