Bandit Convex Optimization in Nonstationary Environments
[edit]
Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, PMLR 108:15081518, 2020.
Abstract
Bandit Convex Optimization (BCO) is a fundamental framework for modeling sequential decisionmaking with partial information, where the only feedback available to the player is the onepoint or twopoint function values. In this paper, we investigate BCO in nonstationary environments and choose the dynamic regret as the performance measure, which is defined as the difference between the cumulative loss incurred by the algorithm and that of any feasible comparator sequence. Let $T$ be the time horizon and $P_T$ be the pathlength of the comparator sequence that reflects the nonstationarity of environments. We propose a novel algorithm that achieves $O(T^{3/4}(1+P_T)^{1/2})$ and $O(T^{1/2}(1+P_T)^{1/2})$ dynamic regret respectively for the onepoint and twopoint feedback models. The latter result is optimal, matching the $\Omega(T^{1/2}(1+P_T)^{1/2})$ lower bound established in this paper. Notably, our algorithm is more adaptive to nonstationary environments since it does not require prior knowledge of the pathlength $P_T$ ahead of time, which is generally unknown.
Related Material


