Stochastic Bandit Based on Empirical Moments
Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics, PMLR 22:529-537, 2012.
In the multiarmed bandit problem a gambler chooses an arm of a slot machine to pull considering a tradeoff between exploration and exploitation. We study the stochastic bandit problem where each arm has a reward distribution supported in [0,1]. For this model, there exists a policy which achieves the theoretical bound asymptotically. However the optimal policy requires a computation of a convex optimization which involves the empirical distribution of each arm. In this paper, we propose a policy which exploits the first d empirical moments for arbitrary d fixed in advance. We show that the performance of the policy approaches the theoretical bound as d increases. This policy can be implemented by solving polynomial equations, which we derive the explicit solution for d smaller than 5. By choosing appropriate d, the proposed policy realizes a tradeoff between the computational complexity and the expected regret.