The Intrinsic Robustness of Stochastic Bandits to Strategic Manipulation

Zhe Feng, David Parkes, Haifeng Xu
Proceedings of the 37th International Conference on Machine Learning, PMLR 119:3092-3101, 2020.

Abstract

Motivated by economic applications such as recommender systems, we study the behavior of stochastic bandits algorithms under \emph{strategic behavior} conducted by rational actors, i.e., the arms. Each arm is a \emph{self-interested} strategic player who can modify its own reward whenever pulled, subject to a cross-period budget constraint, in order to maximize its own expected number of times of being pulled. We analyze the robustness of three popular bandit algorithms: UCB, $\varepsilon$-Greedy, and Thompson Sampling. We prove that all three algorithms achieve a regret upper bound $\mathcal{O}(\max \{ B, K\ln T\})$ where $B$ is the total budget across arms, $K$ is the total number of arms and $T$ is the running time of the algorithms. This regret guarantee holds for \emph{arbitrary adaptive} manipulation strategy of arms. Our second set of main results shows that this regret bound is \emph{tight}— in fact, for UCB, it is tight even when we restrict the arms’ manipulation strategies to form a \emph{Nash equilibrium}. We do so by characterizing the Nash equilibrium of the game induced by arms’ strategic manipulations and show a regret lower bound of $\Omega(\max \{ B, K\ln T\})$ at the equilibrium.

Cite this Paper


BibTeX
@InProceedings{pmlr-v119-feng20c, title = {The Intrinsic Robustness of Stochastic Bandits to Strategic Manipulation}, author = {Feng, Zhe and Parkes, David and Xu, Haifeng}, booktitle = {Proceedings of the 37th International Conference on Machine Learning}, pages = {3092--3101}, year = {2020}, editor = {III, Hal Daumé and Singh, Aarti}, volume = {119}, series = {Proceedings of Machine Learning Research}, month = {13--18 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v119/feng20c/feng20c.pdf}, url = {http://proceedings.mlr.press/v119/feng20c.html}, abstract = {Motivated by economic applications such as recommender systems, we study the behavior of stochastic bandits algorithms under \emph{strategic behavior} conducted by rational actors, i.e., the arms. Each arm is a \emph{self-interested} strategic player who can modify its own reward whenever pulled, subject to a cross-period budget constraint, in order to maximize its own expected number of times of being pulled. We analyze the robustness of three popular bandit algorithms: UCB, $\varepsilon$-Greedy, and Thompson Sampling. We prove that all three algorithms achieve a regret upper bound $\mathcal{O}(\max \{ B, K\ln T\})$ where $B$ is the total budget across arms, $K$ is the total number of arms and $T$ is the running time of the algorithms. This regret guarantee holds for \emph{arbitrary adaptive} manipulation strategy of arms. Our second set of main results shows that this regret bound is \emph{tight}— in fact, for UCB, it is tight even when we restrict the arms’ manipulation strategies to form a \emph{Nash equilibrium}. We do so by characterizing the Nash equilibrium of the game induced by arms’ strategic manipulations and show a regret lower bound of $\Omega(\max \{ B, K\ln T\})$ at the equilibrium.} }
Endnote
%0 Conference Paper %T The Intrinsic Robustness of Stochastic Bandits to Strategic Manipulation %A Zhe Feng %A David Parkes %A Haifeng Xu %B Proceedings of the 37th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2020 %E Hal Daumé III %E Aarti Singh %F pmlr-v119-feng20c %I PMLR %P 3092--3101 %U http://proceedings.mlr.press/v119/feng20c.html %V 119 %X Motivated by economic applications such as recommender systems, we study the behavior of stochastic bandits algorithms under \emph{strategic behavior} conducted by rational actors, i.e., the arms. Each arm is a \emph{self-interested} strategic player who can modify its own reward whenever pulled, subject to a cross-period budget constraint, in order to maximize its own expected number of times of being pulled. We analyze the robustness of three popular bandit algorithms: UCB, $\varepsilon$-Greedy, and Thompson Sampling. We prove that all three algorithms achieve a regret upper bound $\mathcal{O}(\max \{ B, K\ln T\})$ where $B$ is the total budget across arms, $K$ is the total number of arms and $T$ is the running time of the algorithms. This regret guarantee holds for \emph{arbitrary adaptive} manipulation strategy of arms. Our second set of main results shows that this regret bound is \emph{tight}— in fact, for UCB, it is tight even when we restrict the arms’ manipulation strategies to form a \emph{Nash equilibrium}. We do so by characterizing the Nash equilibrium of the game induced by arms’ strategic manipulations and show a regret lower bound of $\Omega(\max \{ B, K\ln T\})$ at the equilibrium.
APA
Feng, Z., Parkes, D. & Xu, H.. (2020). The Intrinsic Robustness of Stochastic Bandits to Strategic Manipulation. Proceedings of the 37th International Conference on Machine Learning, in Proceedings of Machine Learning Research 119:3092-3101 Available from http://proceedings.mlr.press/v119/feng20c.html.

Related Material