Leveraging Initial Hints for Free in Stochastic Linear Bandits

Ashok Cutkosky; Chris Dann; Abhimanyu Das; Qiuyi Zhang

Leveraging Initial Hints for Free in Stochastic Linear Bandits

Ashok Cutkosky, Chris Dann, Abhimanyu Das, Qiuyi Zhang

Proceedings of The 33rd International Conference on Algorithmic Learning Theory, PMLR 167:282-318, 2022.

Abstract

We study the setting of optimizing with bandit feedback with additional prior knowledge provided to the learner in the form of an initial hint of the optimal action. We present a novel algorithm for stochastic linear bandits that uses this hint to improve its regret to $\tilde O(\sqrt{T})$ when the hint is accurate, while maintaining a minimax-optimal $\tilde O(d\sqrt{T})$ regret independent of the quality of the hint. Furthermore, we provide a Pareto frontier of tight tradeoffs between best-case and worst-case regret, with matching lower bounds. Perhaps surprisingly, our work shows that leveraging a hint shows provable gains without sacrificing worst-case performance, implying that our algorithm adapts to the quality of the hint for free. We also provide an extension of our algorithm to the case of $m$ initial hints, showing that we can achieve a $\tilde O(m^{2/3}\sqrt{T})$ regret.

Cite this Paper

BibTeX


@InProceedings{pmlr-v167-cutkosky22a,
  title = 	 {Leveraging Initial Hints for Free in Stochastic Linear Bandits},
  author =       {Cutkosky, Ashok and Dann, Chris and Das, Abhimanyu and Zhang, Qiuyi},
  booktitle = 	 {Proceedings of The 33rd International Conference on Algorithmic Learning Theory},
  pages = 	 {282--318},
  year = 	 {2022},
  editor = 	 {Dasgupta, Sanjoy and Haghtalab, Nika},
  volume = 	 {167},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {29 Mar--01 Apr},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v167/cutkosky22a/cutkosky22a.pdf},
  url = 	 {https://proceedings.mlr.press/v167/cutkosky22a.html},
  abstract = 	 {We study the setting of optimizing with bandit feedback with additional prior knowledge provided to the learner in the form of an initial hint of the optimal action. We present a novel algorithm for stochastic linear bandits that uses this hint to improve its regret to $\tilde O(\sqrt{T})$ when the hint is accurate, while maintaining a minimax-optimal $\tilde O(d\sqrt{T})$ regret independent of the quality of the hint. Furthermore, we provide a Pareto frontier of tight tradeoffs between best-case and worst-case regret, with matching lower bounds. Perhaps surprisingly, our work shows that leveraging a hint shows provable gains without sacrificing worst-case performance, implying that our algorithm adapts to the quality of the hint for free. We also provide an extension of our algorithm to the case of $m$ initial hints, showing that we can achieve a $\tilde O(m^{2/3}\sqrt{T})$ regret.}
}

Endnote

%0 Conference Paper
%T Leveraging Initial Hints for Free in Stochastic Linear Bandits
%A Ashok Cutkosky
%A Chris Dann
%A Abhimanyu Das
%A Qiuyi Zhang
%B Proceedings of The 33rd International Conference on Algorithmic Learning Theory
%C Proceedings of Machine Learning Research
%D 2022
%E Sanjoy Dasgupta
%E Nika Haghtalab	
%F pmlr-v167-cutkosky22a
%I PMLR
%P 282--318
%U https://proceedings.mlr.press/v167/cutkosky22a.html
%V 167
%X We study the setting of optimizing with bandit feedback with additional prior knowledge provided to the learner in the form of an initial hint of the optimal action. We present a novel algorithm for stochastic linear bandits that uses this hint to improve its regret to $\tilde O(\sqrt{T})$ when the hint is accurate, while maintaining a minimax-optimal $\tilde O(d\sqrt{T})$ regret independent of the quality of the hint. Furthermore, we provide a Pareto frontier of tight tradeoffs between best-case and worst-case regret, with matching lower bounds. Perhaps surprisingly, our work shows that leveraging a hint shows provable gains without sacrificing worst-case performance, implying that our algorithm adapts to the quality of the hint for free. We also provide an extension of our algorithm to the case of $m$ initial hints, showing that we can achieve a $\tilde O(m^{2/3}\sqrt{T})$ regret.

APA


Cutkosky, A., Dann, C., Das, A. & Zhang, Q.. (2022). Leveraging Initial Hints for Free in Stochastic Linear Bandits. Proceedings of The 33rd International Conference on Algorithmic Learning Theory, in Proceedings of Machine Learning Research 167:282-318 Available from https://proceedings.mlr.press/v167/cutkosky22a.html.

Related Material

Download PDF