[edit]
Optimal regret algorithm for Pseudo-1d Bandit Convex Optimization
Proceedings of the 38th International Conference on Machine Learning, PMLR 139:9255-9264, 2021.
Abstract
We study online learning with bandit feedback (i.e. learner has access to only zeroth-order oracle) where cost/reward functions \ft admit a "pseudo-1d" structure, i.e. \ft(\w)=\losst(\predt(\w)) where the output of \predt is one-dimensional. At each round, the learner observes context \xt, plays prediction \predt(\wt;\xt) (e.g. \predt(⋅)=⟨\xt,⋅⟩) for some \wt∈Rd and observes loss \losst(\predt(\wt)) where \losst is a convex Lipschitz-continuous function. The goal is to minimize the standard regret metric. This pseudo-1d bandit convex optimization problem (\SBCO) arises frequently in domains such as online decision-making or parameter-tuning in large systems. For this problem, we first show a regret lower bound of min for any algorithm, where T is the number of rounds. We propose a new algorithm \sbcalg that combines randomized online gradient descent with a kernelized exponential weights method to exploit the pseudo-1d structure effectively, guaranteeing the {\em optimal} regret bound mentioned above, up to additional logarithmic factors. In contrast, applying state-of-the-art online convex optimization methods leads to \tilde{O}\left(\min\left(d^{9.5}\sqrt{T},\sqrt{d}T^{3/4}\right)\right) regret, that is significantly suboptimal in terms of d.