Best-of-Three-Worlds Linear Bandit Algorithm with Variance-Adaptive Regret Bounds

Shinji Ito; Kei Takemura

Best-of-Three-Worlds Linear Bandit Algorithm with Variance-Adaptive Regret Bounds

Shinji Ito, Kei Takemura

Proceedings of Thirty Sixth Conference on Learning Theory, PMLR 195:2653-2677, 2023.

Abstract

This paper proposes a linear bandit algorithm that is adaptive to environments at two different levels of hierarchy. At the higher level, the proposed algorithm adapts to a variety of types of environments. More precisely, it achieves best-of-three-worlds regret bounds, i.e., of

${O}(\sqrt{T \log T})$ for adversarial environments and of

$O(\frac{\log T}{\Delta_{\min}} + \sqrt{\frac{C \log T}{\Delta_{\min}}})$ for stochastic environments with adversarial corruptions, where

$T$ ,

$\Delta_{\min}$ , and

$C$ denote, respectively, the time horizon, the minimum sub-optimality gap, and the total amount of the corruption. Note that polynomial factors in the dimensionality are omitted here. At the lower level, in each of the adversarial and stochastic regimes, the proposed algorithm adapts to certain environmental characteristics, thereby performing better. The proposed algorithm has data-dependent regret bounds that depend on all of the cumulative loss for the optimal action, the total quadratic variation, and the path-length of the loss vector sequence. In addition, for stochastic environments, the proposed algorithm has a variance-adaptive regret bound of

$O(\frac{\sigma^2 \log T}{\Delta_{\min}})$ as well, where

$\sigma^2$ denotes the maximum variance of the feedback loss. The proposed algorithm is based on the \texttt{SCRiBLe} algorithm (Abernethy et al., 2012). By incorporating into this a new technique we call \textit{scaled-up sampling}, we obtain high-level adaptability, and by incorporating the technique of optimistic online learning, we obtain low-level adaptability.

Cite this Paper

BibTeX


@InProceedings{pmlr-v195-ito23a,
  title = 	 {Best-of-Three-Worlds Linear Bandit Algorithm with Variance-Adaptive Regret Bounds},
  author =       {Ito, Shinji and Takemura, Kei},
  booktitle = 	 {Proceedings of Thirty Sixth Conference on Learning Theory},
  pages = 	 {2653--2677},
  year = 	 {2023},
  editor = 	 {Neu, Gergely and Rosasco, Lorenzo},
  volume = 	 {195},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {12--15 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v195/ito23a/ito23a.pdf},
  url = 	 {https://proceedings.mlr.press/v195/ito23a.html},
  abstract = 	 {This paper proposes a linear bandit algorithm that is adaptive to environments at two different levels of hierarchy.  At the higher level, the proposed algorithm adapts to a variety of types of environments.  More precisely, it achieves best-of-three-worlds regret bounds, i.e., of ${O}(\sqrt{T \log T})$ for adversarial environments and of $O(\frac{\log T}{\Delta_{\min}} + \sqrt{\frac{C \log T}{\Delta_{\min}}})$ for stochastic environments with adversarial corruptions, where $T$, $\Delta_{\min}$, and $C$ denote, respectively, the time horizon, the minimum sub-optimality gap, and the total amount of the corruption.  Note that polynomial factors in the dimensionality are omitted here.  At the lower level, in each of the adversarial and stochastic regimes, the proposed algorithm adapts to certain environmental characteristics, thereby performing better.  The proposed algorithm has data-dependent regret bounds that depend on all of the cumulative loss for the optimal action, the total quadratic variation, and the path-length of the loss vector sequence.  In addition, for stochastic environments, the proposed algorithm has a variance-adaptive regret bound of $O(\frac{\sigma^2 \log T}{\Delta_{\min}})$ as well, where $\sigma^2$ denotes the maximum variance of the feedback loss.  The proposed algorithm is based on the \texttt{SCRiBLe} algorithm (Abernethy et al., 2012).  By incorporating into this a new technique we call \textit{scaled-up sampling}, we obtain high-level adaptability, and by incorporating the technique of optimistic online learning, we obtain low-level adaptability.}
}

Endnote

%0 Conference Paper
%T Best-of-Three-Worlds Linear Bandit Algorithm with Variance-Adaptive Regret Bounds
%A Shinji Ito
%A Kei Takemura
%B Proceedings of Thirty Sixth Conference on Learning Theory
%C Proceedings of Machine Learning Research
%D 2023
%E Gergely Neu
%E Lorenzo Rosasco	
%F pmlr-v195-ito23a
%I PMLR
%P 2653--2677
%U https://proceedings.mlr.press/v195/ito23a.html
%V 195
%X This paper proposes a linear bandit algorithm that is adaptive to environments at two different levels of hierarchy.  At the higher level, the proposed algorithm adapts to a variety of types of environments.  More precisely, it achieves best-of-three-worlds regret bounds, i.e., of ${O}(\sqrt{T \log T})$ for adversarial environments and of $O(\frac{\log T}{\Delta_{\min}} + \sqrt{\frac{C \log T}{\Delta_{\min}}})$ for stochastic environments with adversarial corruptions, where $T$, $\Delta_{\min}$, and $C$ denote, respectively, the time horizon, the minimum sub-optimality gap, and the total amount of the corruption.  Note that polynomial factors in the dimensionality are omitted here.  At the lower level, in each of the adversarial and stochastic regimes, the proposed algorithm adapts to certain environmental characteristics, thereby performing better.  The proposed algorithm has data-dependent regret bounds that depend on all of the cumulative loss for the optimal action, the total quadratic variation, and the path-length of the loss vector sequence.  In addition, for stochastic environments, the proposed algorithm has a variance-adaptive regret bound of $O(\frac{\sigma^2 \log T}{\Delta_{\min}})$ as well, where $\sigma^2$ denotes the maximum variance of the feedback loss.  The proposed algorithm is based on the \texttt{SCRiBLe} algorithm (Abernethy et al., 2012).  By incorporating into this a new technique we call \textit{scaled-up sampling}, we obtain high-level adaptability, and by incorporating the technique of optimistic online learning, we obtain low-level adaptability.

APA


Ito, S. & Takemura, K.. (2023). Best-of-Three-Worlds Linear Bandit Algorithm with Variance-Adaptive Regret Bounds. Proceedings of Thirty Sixth Conference on Learning Theory, in Proceedings of Machine Learning Research 195:2653-2677 Available from https://proceedings.mlr.press/v195/ito23a.html.

Related Material

Download PDF