[edit]
Return of the bias: Almost minimax optimal high probability bounds for adversarial linear bandits
Proceedings of Thirty Fifth Conference on Learning Theory, PMLR 178:3285-3312, 2022.
Abstract
We introduce a modification of follow the regularised leader and combine it with the log determinant potential and suitable loss estimators to prove that the minimax regret for adaptive adversarial linear bandits is at most $O(d \sqrt{T \log(T)})$ where $d$ is the dimension and $T$ is the number of rounds. By using exponential weights, we improve this bound to $O(\sqrt{dT\log(kT)})$ when the action set has size $k$. These results confirms an old conjecture. We also show that follow the regularized leader with the entropic barrier and suitable loss estimators has regret against an adaptive adversary of at most $O(d^2 \sqrt{T} \log(T))$ and can be implement in polynomial time, which improves on the best known bound for an efficient algorithm of $O(d^{7/2} \sqrt{T} \poly(\log(T)))$ by Lee et al 2020.