CVaR-Regret Bounds for Multi-armed Bandits

Chenmien Tan, Paul Weng
Proceedings of The 14th Asian Conference on Machine Learning, PMLR 189:974-989, 2023.

Abstract

In contrast to risk-averse multi-armed bandit (MAB), where one aims for a best risk-sensitive arm while having a risk-neutral attitude when running the risk-averse MAB algorithm, in this paper, we aim for a best arm with respect to the mean like in the standard MAB, but we adopt a risk-averse attitude when running a standard MAB algorithm. Conditional value-at-risk (CVaR) of the regret is adopted as the metric to evaluate the performance of algorithms, which is an extension of the traditional expected regret minimization framework. For this new problem, we revisit several classic algorithms for stochastic and non-stochastic bandits, UCB, MOSS, and Exp3-IX with its variants and propose parameters with good theoretically guaranteed CVaR-regret, which match the results of the expected regret and achieve (nearly-)optimality up to constant. In the non-stochastic setting, we show that implicit exploration achieves a trade-off between the variability of the regret and the regret in expectation. Numerical experiments are conducted to validate our results.

Cite this Paper


BibTeX
@InProceedings{pmlr-v189-tan23a, title = {CVaR-Regret Bounds for Multi-armed Bandits}, author = {Tan, Chenmien and Weng, Paul}, booktitle = {Proceedings of The 14th Asian Conference on Machine Learning}, pages = {974--989}, year = {2023}, editor = {Khan, Emtiyaz and Gonen, Mehmet}, volume = {189}, series = {Proceedings of Machine Learning Research}, month = {12--14 Dec}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v189/tan23a/tan23a.pdf}, url = {https://proceedings.mlr.press/v189/tan23a.html}, abstract = {In contrast to risk-averse multi-armed bandit (MAB), where one aims for a best risk-sensitive arm while having a risk-neutral attitude when running the risk-averse MAB algorithm, in this paper, we aim for a best arm with respect to the mean like in the standard MAB, but we adopt a risk-averse attitude when running a standard MAB algorithm. Conditional value-at-risk (CVaR) of the regret is adopted as the metric to evaluate the performance of algorithms, which is an extension of the traditional expected regret minimization framework. For this new problem, we revisit several classic algorithms for stochastic and non-stochastic bandits, UCB, MOSS, and Exp3-IX with its variants and propose parameters with good theoretically guaranteed CVaR-regret, which match the results of the expected regret and achieve (nearly-)optimality up to constant. In the non-stochastic setting, we show that implicit exploration achieves a trade-off between the variability of the regret and the regret in expectation. Numerical experiments are conducted to validate our results.} }
Endnote
%0 Conference Paper %T CVaR-Regret Bounds for Multi-armed Bandits %A Chenmien Tan %A Paul Weng %B Proceedings of The 14th Asian Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2023 %E Emtiyaz Khan %E Mehmet Gonen %F pmlr-v189-tan23a %I PMLR %P 974--989 %U https://proceedings.mlr.press/v189/tan23a.html %V 189 %X In contrast to risk-averse multi-armed bandit (MAB), where one aims for a best risk-sensitive arm while having a risk-neutral attitude when running the risk-averse MAB algorithm, in this paper, we aim for a best arm with respect to the mean like in the standard MAB, but we adopt a risk-averse attitude when running a standard MAB algorithm. Conditional value-at-risk (CVaR) of the regret is adopted as the metric to evaluate the performance of algorithms, which is an extension of the traditional expected regret minimization framework. For this new problem, we revisit several classic algorithms for stochastic and non-stochastic bandits, UCB, MOSS, and Exp3-IX with its variants and propose parameters with good theoretically guaranteed CVaR-regret, which match the results of the expected regret and achieve (nearly-)optimality up to constant. In the non-stochastic setting, we show that implicit exploration achieves a trade-off between the variability of the regret and the regret in expectation. Numerical experiments are conducted to validate our results.
APA
Tan, C. & Weng, P.. (2023). CVaR-Regret Bounds for Multi-armed Bandits. Proceedings of The 14th Asian Conference on Machine Learning, in Proceedings of Machine Learning Research 189:974-989 Available from https://proceedings.mlr.press/v189/tan23a.html.

Related Material