[edit]
Low-rank Matrix Bandits with Heavy-tailed Rewards
Proceedings of the Fortieth Conference on Uncertainty in Artificial Intelligence, PMLR 244:1863-1889, 2024.
Abstract
In stochastic low-rank matrix bandit, the expected reward of an arm is equal to the inner product between its feature matrix and some unknown d1 by d2 low-rank parameter matrix Θ∗ with rank r≪d1∧d2. While all prior studies assume the payoffs are mixed with sub-Gaussian noises, in this work we loosen this strict assumption and consider the new problem of low-rank matrix bandit with heavy-tailed rewards (LowHTR), where the rewards only have finite (1+δ) moment for some δ∈(0,1]. By utilizing the truncation on observed payoffs and the dynamic exploration, we propose a novel algorithm called LOTUS attaining the regret bound of order ˜O(d32r12T11+δ/˜Drr) without knowing T, which matches the state-of-the-art regret bound under sub-Gaussian noises \citep{lu2021low,kang2022efficient} with δ=1. Moreover, we establish a lower bound of the order Ω(dδ1+δrδ1+δT11+δ)=Ω(T11+δ) for LowHTR, which indicates our LOTUS is nearly optimal in the order of T. In addition, we improve LOTUS so that it does not require knowledge of the rank r with ˜O(dr32T1+δ1+2δ) regret bound, and it is efficient under the high-dimensional scenario. We also conduct simulations to demonstrate the practical superiority of our algorithm.