Bandits with costly reward observations

Aaron D. Tucker; Caleb Biddulph; Claire Wang; Thorsten Joachims

Bandits with costly reward observations

Aaron D. Tucker, Caleb Biddulph, Claire Wang, Thorsten Joachims

Proceedings of the Thirty-Ninth Conference on Uncertainty in Artificial Intelligence, PMLR 216:2147-2156, 2023.

Abstract

Many machine learning applications rely on large datasets that are conveniently collected from existing sources or that are labeled automatically as a by-product of user actions. However, in settings such as content moderation, accurately and reliably labeled data comes at substantial cost. If a learning algorithm has to pay for reward information, for example by asking a human for feedback, how does this change the exploration/exploitation tradeoff? We study this question in the context of bandit learning. Specifically, we investigate Bandits with Costly Reward Observations, where a cost needs to be paid in order to observe the reward of the bandit’s action. We show that the observation cost implies an

$\Omega(c^{1/3}T^{2/3})$ lower bound on the regret. Furthermore, we develop a general non-adaptive bandit algorithm which matches this lower bound, and we present several competitive adaptive learning algorithms for both k-armed and contextual bandits.

Cite this Paper

BibTeX


@InProceedings{pmlr-v216-tucker23a,
  title = 	 {Bandits with costly reward observations},
  author =       {Tucker, Aaron D. and Biddulph, Caleb and Wang, Claire and Joachims, Thorsten},
  booktitle = 	 {Proceedings of the Thirty-Ninth Conference on Uncertainty in Artificial Intelligence},
  pages = 	 {2147--2156},
  year = 	 {2023},
  editor = 	 {Evans, Robin J. and Shpitser, Ilya},
  volume = 	 {216},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {31 Jul--04 Aug},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v216/tucker23a/tucker23a.pdf},
  url = 	 {https://proceedings.mlr.press/v216/tucker23a.html},
  abstract = 	 {Many machine learning applications rely on large datasets that are conveniently collected from existing sources or that are labeled automatically as a by-product of user actions. However, in settings such as content moderation, accurately and reliably labeled data comes at substantial cost. If a learning algorithm has to pay for reward information, for example by asking a human for feedback, how does this change the exploration/exploitation tradeoff? We study this question in the context of bandit learning. Specifically, we investigate Bandits with Costly Reward Observations, where a cost needs to be paid in order to observe the reward of the bandit’s action. We show that the observation cost implies an $\Omega(c^{1/3}T^{2/3})$ lower bound on the regret. Furthermore, we develop a general non-adaptive bandit algorithm which matches this lower bound, and we present several competitive adaptive learning algorithms for both k-armed and contextual bandits.}
}

Endnote

%0 Conference Paper
%T Bandits with costly reward observations
%A Aaron D. Tucker
%A Caleb Biddulph
%A Claire Wang
%A Thorsten Joachims
%B Proceedings of the Thirty-Ninth Conference on Uncertainty in Artificial Intelligence
%C Proceedings of Machine Learning Research
%D 2023
%E Robin J. Evans
%E Ilya Shpitser	
%F pmlr-v216-tucker23a
%I PMLR
%P 2147--2156
%U https://proceedings.mlr.press/v216/tucker23a.html
%V 216
%X Many machine learning applications rely on large datasets that are conveniently collected from existing sources or that are labeled automatically as a by-product of user actions. However, in settings such as content moderation, accurately and reliably labeled data comes at substantial cost. If a learning algorithm has to pay for reward information, for example by asking a human for feedback, how does this change the exploration/exploitation tradeoff? We study this question in the context of bandit learning. Specifically, we investigate Bandits with Costly Reward Observations, where a cost needs to be paid in order to observe the reward of the bandit’s action. We show that the observation cost implies an $\Omega(c^{1/3}T^{2/3})$ lower bound on the regret. Furthermore, we develop a general non-adaptive bandit algorithm which matches this lower bound, and we present several competitive adaptive learning algorithms for both k-armed and contextual bandits.

APA


Tucker, A.D., Biddulph, C., Wang, C. & Joachims, T.. (2023). Bandits with costly reward observations. Proceedings of the Thirty-Ninth Conference on Uncertainty in Artificial Intelligence, in Proceedings of Machine Learning Research 216:2147-2156 Available from https://proceedings.mlr.press/v216/tucker23a.html.

Bandits with costly reward observations

Abstract

Cite this Paper

Related Material