Complete Policy Regret Bounds for Tallying Bandits

Dhruv Malik, Yuanzhi Li, Aarti Singh
Proceedings of Thirty Fifth Conference on Learning Theory, PMLR 178:5146-5174, 2022.

Abstract

Policy regret is a well established notion of measuring the performance of an online learning algorithm against an adaptive adversary. We study restrictions on the adversary that enable efficient minimization of the \emph{complete policy regret}, which is the strongest possible version of policy regret. We identify a gap in the current theoretical understanding of what sorts of restrictions permit tractability in this challenging setting. To resolve this gap, we consider a generalization of the stochastic multi armed bandit, which we call the \emph{tallying bandit}. This is an online learning setting with an m-memory bounded adversary, where the average loss for playing an action is an unknown function of the number (or tally) of times that the action was played in the last m timesteps. For tallying bandit problems with \numact actions and time horizon T, we provide an algorithm that w.h.p achieves a complete policy regret guarantee of \bigo(m\numactT), where the \bigo notation hides only logarithmic factors. We additionally prove an \bigomega(m\numactT) lower bound on the expected complete policy regret of any tallying bandit algorithm, demonstrating the near optimality of our method.

Cite this Paper


BibTeX
@InProceedings{pmlr-v178-malik22a, title = {Complete Policy Regret Bounds for Tallying Bandits}, author = {Malik, Dhruv and Li, Yuanzhi and Singh, Aarti}, booktitle = {Proceedings of Thirty Fifth Conference on Learning Theory}, pages = {5146--5174}, year = {2022}, editor = {Loh, Po-Ling and Raginsky, Maxim}, volume = {178}, series = {Proceedings of Machine Learning Research}, month = {02--05 Jul}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v178/malik22a/malik22a.pdf}, url = {https://proceedings.mlr.press/v178/malik22a.html}, abstract = {Policy regret is a well established notion of measuring the performance of an online learning algorithm against an adaptive adversary. We study restrictions on the adversary that enable efficient minimization of the \emph{complete policy regret}, which is the strongest possible version of policy regret. We identify a gap in the current theoretical understanding of what sorts of restrictions permit tractability in this challenging setting. To resolve this gap, we consider a generalization of the stochastic multi armed bandit, which we call the \emph{tallying bandit}. This is an online learning setting with an $m$-memory bounded adversary, where the average loss for playing an action is an unknown function of the number (or tally) of times that the action was played in the last $m$ timesteps. For tallying bandit problems with $\numact$ actions and time horizon $T$, we provide an algorithm that w.h.p achieves a complete policy regret guarantee of $\bigo ( m \numact \sqrt{T} )$, where the $\bigo$ notation hides only logarithmic factors. We additionally prove an $\bigomega(\sqrt{ m \numact T})$ lower bound on the expected complete policy regret of any tallying bandit algorithm, demonstrating the near optimality of our method.} }
Endnote
%0 Conference Paper %T Complete Policy Regret Bounds for Tallying Bandits %A Dhruv Malik %A Yuanzhi Li %A Aarti Singh %B Proceedings of Thirty Fifth Conference on Learning Theory %C Proceedings of Machine Learning Research %D 2022 %E Po-Ling Loh %E Maxim Raginsky %F pmlr-v178-malik22a %I PMLR %P 5146--5174 %U https://proceedings.mlr.press/v178/malik22a.html %V 178 %X Policy regret is a well established notion of measuring the performance of an online learning algorithm against an adaptive adversary. We study restrictions on the adversary that enable efficient minimization of the \emph{complete policy regret}, which is the strongest possible version of policy regret. We identify a gap in the current theoretical understanding of what sorts of restrictions permit tractability in this challenging setting. To resolve this gap, we consider a generalization of the stochastic multi armed bandit, which we call the \emph{tallying bandit}. This is an online learning setting with an $m$-memory bounded adversary, where the average loss for playing an action is an unknown function of the number (or tally) of times that the action was played in the last $m$ timesteps. For tallying bandit problems with $\numact$ actions and time horizon $T$, we provide an algorithm that w.h.p achieves a complete policy regret guarantee of $\bigo ( m \numact \sqrt{T} )$, where the $\bigo$ notation hides only logarithmic factors. We additionally prove an $\bigomega(\sqrt{ m \numact T})$ lower bound on the expected complete policy regret of any tallying bandit algorithm, demonstrating the near optimality of our method.
APA
Malik, D., Li, Y. & Singh, A.. (2022). Complete Policy Regret Bounds for Tallying Bandits. Proceedings of Thirty Fifth Conference on Learning Theory, in Proceedings of Machine Learning Research 178:5146-5174 Available from https://proceedings.mlr.press/v178/malik22a.html.

Related Material