[edit]
CVaR-Regret Bounds for Multi-armed Bandits
Proceedings of The 14th Asian Conference on Machine
Learning, PMLR 189:974-989, 2023.
Abstract
In contrast to risk-averse multi-armed bandit (MAB),
where one aims for a best risk-sensitive arm while
having a risk-neutral attitude when running the
risk-averse MAB algorithm, in this paper, we aim for
a best arm with respect to the mean like in the
standard MAB, but we adopt a risk-averse attitude
when running a standard MAB algorithm. Conditional
value-at-risk (CVaR) of the regret is adopted as the
metric to evaluate the performance of algorithms,
which is an extension of the traditional expected
regret minimization framework. For this new problem,
we revisit several classic algorithms for stochastic
and non-stochastic bandits, UCB, MOSS, and Exp3-IX
with its variants and propose parameters with good
theoretically guaranteed CVaR-regret, which match
the results of the expected regret and achieve
(nearly-)optimality up to constant. In the
non-stochastic setting, we show that implicit
exploration achieves a trade-off between the
variability of the regret and the regret in
expectation. Numerical experiments are conducted to
validate our results.