Improved High-Probability Regret for Adversarial Bandits with Time-Varying Feedback Graphs

Haipeng Luo; Hanghang Tong; Mengxiao Zhang; Yuheng Zhang

Improved High-Probability Regret for Adversarial Bandits with Time-Varying Feedback Graphs

Haipeng Luo, Hanghang Tong, Mengxiao Zhang, Yuheng Zhang

Proceedings of The 34th International Conference on Algorithmic Learning Theory, PMLR 201:1074-1100, 2023.

Abstract

We study high-probability regret bounds for adversarial $K$-armed bandits with time-varying feedback graphs over $T$ rounds. For general strongly observable graphs, we develop an algorithm that achieves the optimal regret $\widetilde{\mathcal{O}}((\sum_{t=1}^T\alpha_t)^{\frac{1}{2}}+\max_{t\in[T]}\alpha_t)$ with high probability, where $\alpha_t$ is the independence number of the feedback graph at round $t$. Compared to the best existing result (Neu, 15) which only considers graphs with self-loops for all nodes, our result not only holds more generally, but importantly also removes any $\text{poly}(K)$ dependence that can be prohibitively large for applications such as contextual bandits. Furthermore, we also develop the first algorithm that achieves the optimal high-probability regret bound for weakly observable graphs, which even improves the best expected regret bound of (Alon et al., 2015) by removing the $\mathcal{O}(\sqrt{KT})$ term with a refined analysis. Our algorithms are based on the online mirror descent framework, but importantly with an innovative combination of several techniques. Notably, while earlier works use optimistic biased loss estimators for achieving high-probability bounds, we find it important to use a pessimistic one for nodes without self-loop in a strongly observable graph.

Cite this Paper

BibTeX


@InProceedings{pmlr-v201-luo23a,
  title = 	 {Improved High-Probability Regret for Adversarial Bandits with Time-Varying Feedback Graphs},
  author =       {Luo, Haipeng and Tong, Hanghang and Zhang, Mengxiao and Zhang, Yuheng},
  booktitle = 	 {Proceedings of The 34th International Conference on Algorithmic Learning Theory},
  pages = 	 {1074--1100},
  year = 	 {2023},
  editor = 	 {Agrawal, Shipra and Orabona, Francesco},
  volume = 	 {201},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {20 Feb--23 Feb},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v201/luo23a/luo23a.pdf},
  url = 	 {https://proceedings.mlr.press/v201/luo23a.html},
  abstract = 	 {We study high-probability regret bounds for adversarial $K$-armed bandits with time-varying feedback graphs over $T$ rounds. For general strongly observable graphs, we develop an algorithm that achieves the optimal regret $\widetilde{\mathcal{O}}((\sum_{t=1}^T\alpha_t)^{\frac{1}{2}}+\max_{t\in[T]}\alpha_t)$ with high probability, where $\alpha_t$ is the independence number of the feedback graph at round $t$. Compared to the best existing result (Neu, 15) which only considers graphs with self-loops for all nodes, our result not only holds more generally, but importantly also removes any $\text{poly}(K)$ dependence that can be prohibitively large for applications such as contextual bandits. Furthermore, we also develop the first algorithm that achieves the optimal high-probability regret bound for weakly observable graphs, which even improves the best expected regret bound of (Alon et al., 2015) by removing the $\mathcal{O}(\sqrt{KT})$ term with a refined analysis. Our algorithms are based on the online mirror descent framework, but importantly with an innovative combination of several techniques. Notably, while earlier works use optimistic biased loss estimators for achieving high-probability bounds, we find it important to use a pessimistic one for nodes without self-loop in a strongly observable graph.}
}

Endnote

%0 Conference Paper
%T Improved High-Probability Regret for Adversarial Bandits with Time-Varying Feedback Graphs
%A Haipeng Luo
%A Hanghang Tong
%A Mengxiao Zhang
%A Yuheng Zhang
%B Proceedings of The 34th International Conference on Algorithmic Learning Theory
%C Proceedings of Machine Learning Research
%D 2023
%E Shipra Agrawal
%E Francesco Orabona	
%F pmlr-v201-luo23a
%I PMLR
%P 1074--1100
%U https://proceedings.mlr.press/v201/luo23a.html
%V 201
%X We study high-probability regret bounds for adversarial $K$-armed bandits with time-varying feedback graphs over $T$ rounds. For general strongly observable graphs, we develop an algorithm that achieves the optimal regret $\widetilde{\mathcal{O}}((\sum_{t=1}^T\alpha_t)^{\frac{1}{2}}+\max_{t\in[T]}\alpha_t)$ with high probability, where $\alpha_t$ is the independence number of the feedback graph at round $t$. Compared to the best existing result (Neu, 15) which only considers graphs with self-loops for all nodes, our result not only holds more generally, but importantly also removes any $\text{poly}(K)$ dependence that can be prohibitively large for applications such as contextual bandits. Furthermore, we also develop the first algorithm that achieves the optimal high-probability regret bound for weakly observable graphs, which even improves the best expected regret bound of (Alon et al., 2015) by removing the $\mathcal{O}(\sqrt{KT})$ term with a refined analysis. Our algorithms are based on the online mirror descent framework, but importantly with an innovative combination of several techniques. Notably, while earlier works use optimistic biased loss estimators for achieving high-probability bounds, we find it important to use a pessimistic one for nodes without self-loop in a strongly observable graph.

APA


Luo, H., Tong, H., Zhang, M. & Zhang, Y.. (2023). Improved High-Probability Regret for Adversarial Bandits with Time-Varying Feedback Graphs. Proceedings of The 34th International Conference on Algorithmic Learning Theory, in Proceedings of Machine Learning Research 201:1074-1100 Available from https://proceedings.mlr.press/v201/luo23a.html.

Related Material

Download PDF