Complete Policy Regret Bounds for Tallying Bandits

Dhruv Malik; Yuanzhi Li; Aarti Singh

Complete Policy Regret Bounds for Tallying Bandits

Dhruv Malik, Yuanzhi Li, Aarti Singh

Proceedings of Thirty Fifth Conference on Learning Theory, PMLR 178:5146-5174, 2022.

Abstract

Policy regret is a well established notion of measuring the performance of an online learning algorithm against an adaptive adversary. We study restrictions on the adversary that enable efficient minimization of the \emph{complete policy regret}, which is the strongest possible version of policy regret. We identify a gap in the current theoretical understanding of what sorts of restrictions permit tractability in this challenging setting. To resolve this gap, we consider a generalization of the stochastic multi armed bandit, which we call the \emph{tallying bandit}. This is an online learning setting with an

$m$ -memory bounded adversary, where the average loss for playing an action is an unknown function of the number (or tally) of times that the action was played in the last

$m$ timesteps. For tallying bandit problems with

$\numact$ actions and time horizon

$T$ , we provide an algorithm that w.h.p achieves a complete policy regret guarantee of

$\bigo ( m \numact \sqrt{T} )$ , where the

$\bigo$ notation hides only logarithmic factors. We additionally prove an

$\bigomega(\sqrt{ m \numact T})$ lower bound on the expected complete policy regret of any tallying bandit algorithm, demonstrating the near optimality of our method.

Cite this Paper

BibTeX


@InProceedings{pmlr-v178-malik22a,
  title = 	 {Complete Policy Regret Bounds for Tallying Bandits},
  author =       {Malik, Dhruv and Li, Yuanzhi and Singh, Aarti},
  booktitle = 	 {Proceedings of Thirty Fifth Conference on Learning Theory},
  pages = 	 {5146--5174},
  year = 	 {2022},
  editor = 	 {Loh, Po-Ling and Raginsky, Maxim},
  volume = 	 {178},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {02--05 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v178/malik22a/malik22a.pdf},
  url = 	 {https://proceedings.mlr.press/v178/malik22a.html},
  abstract = 	 {Policy regret is a well established notion of measuring the performance of an online learning algorithm against an adaptive adversary. We study restrictions on the adversary that enable efficient minimization of the \emph{complete policy regret}, which is the strongest possible version of policy regret. We identify a gap in the current theoretical understanding of what sorts of restrictions permit tractability in this challenging setting. To resolve this gap, we consider a generalization of the stochastic multi armed bandit, which we call the \emph{tallying bandit}. This is an online learning setting with an $m$-memory bounded adversary, where the average loss for playing an action is an unknown function of the number (or tally) of times that the action was played in the last $m$ timesteps. For tallying bandit problems with $\numact$ actions and time horizon $T$, we provide an algorithm that w.h.p achieves a complete policy regret guarantee of $\bigo ( m \numact \sqrt{T} )$, where the $\bigo$ notation hides only logarithmic factors. We additionally prove an $\bigomega(\sqrt{ m \numact T})$ lower bound on the expected complete policy regret of any tallying bandit algorithm, demonstrating the near optimality of our method.}
}

Endnote

%0 Conference Paper
%T Complete Policy Regret Bounds for Tallying Bandits
%A Dhruv Malik
%A Yuanzhi Li
%A Aarti Singh
%B Proceedings of Thirty Fifth Conference on Learning Theory
%C Proceedings of Machine Learning Research
%D 2022
%E Po-Ling Loh
%E Maxim Raginsky	
%F pmlr-v178-malik22a
%I PMLR
%P 5146--5174
%U https://proceedings.mlr.press/v178/malik22a.html
%V 178
%X Policy regret is a well established notion of measuring the performance of an online learning algorithm against an adaptive adversary. We study restrictions on the adversary that enable efficient minimization of the \emph{complete policy regret}, which is the strongest possible version of policy regret. We identify a gap in the current theoretical understanding of what sorts of restrictions permit tractability in this challenging setting. To resolve this gap, we consider a generalization of the stochastic multi armed bandit, which we call the \emph{tallying bandit}. This is an online learning setting with an $m$-memory bounded adversary, where the average loss for playing an action is an unknown function of the number (or tally) of times that the action was played in the last $m$ timesteps. For tallying bandit problems with $\numact$ actions and time horizon $T$, we provide an algorithm that w.h.p achieves a complete policy regret guarantee of $\bigo ( m \numact \sqrt{T} )$, where the $\bigo$ notation hides only logarithmic factors. We additionally prove an $\bigomega(\sqrt{ m \numact T})$ lower bound on the expected complete policy regret of any tallying bandit algorithm, demonstrating the near optimality of our method.

APA


Malik, D., Li, Y. & Singh, A.. (2022). Complete Policy Regret Bounds for Tallying Bandits. Proceedings of Thirty Fifth Conference on Learning Theory, in Proceedings of Machine Learning Research 178:5146-5174 Available from https://proceedings.mlr.press/v178/malik22a.html.

Related Material

Download PDF