Best arm identification in multi-armed bandits with delayed feedback
Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, PMLR 84:833-842, 2018.
In this paper, we propose a generalization of the best arm identification problem in stochastic multi-armed bandits (MAB) to the setting where every pull of an arm is associated with delayed feedbacks. The delay in feedbacks increases the effective sample complexity of the algorithm, but can be offset by partial feedbacks received before a pull is completed. We propose a a general modeling framework to structure in the partial feedbacks, and as a special case we introduce efficient algorithms for best arm identification in settings where the partial feedbacks are biased or unbiased estimators of the final outcome of the pull. Additionally, we propose a novel extension of the algorithms to the parallel MAB setting where an agent can control a batch of arms. Experiments on simulated as well as real world datasets of policy search for charging chemical batteries and hyperparameter optimization for mixed integer programming demonstrate that exploiting the structure of partial and delayed feedbacks can lead to significant improvements over baselines on both sequential and parallel MAB.