[edit]
Efficient Bandit Convex Optimization: Beyond Linear Losses
Proceedings of Thirty Fourth Conference on Learning Theory, PMLR 134:4008-4067, 2021.
Abstract
We study the problem of online learning with bandit feedback, where a learner aims to minimize a sequence of adversarially generated loss functions, while only observing the value of each function at a single point. When the loss functions chosen by the adversary are convex and quadratic, we develop a new algorithm which achieves the optimal regret rate of $O{T^{1/2}}$. Furthermore, our algorithm satisfies three important desiderata: (a) it is practical and can be efficiently implemented for high dimensional problems, (b) the regret bound holds with high probability even against adaptive adversaries whose decisions can depend on the learner’s previous actions, and (c) it is robust to model mis-specification; that is, the regret bound degrades gracefully when the loss functions deviate from convex quadratics. To the best of our knowledge, ours is the first algorithm for bandit convex optimization with quadratic losses which is efficiently implementable and achieves optimal regret guarantees. Existing algorithms for this problem either have sub-optimal regret guarantees or are computationally expensive and do not scale well to high-dimensional problems. Our algorithm is a bandit version of the classical regularized Newton’s method. It involves estimation of gradients and Hessians of the loss functions from single point feedback. A key caveat of working with Hessian estimates however is that they typically have large variance. In this work, we show that it is nonetheless possible to finesse this caveat by relying on the idea of “focus regions”, where we restrict the algorithm to choose actions from a subset of the action space in which the variance of our estimates can be controlled.