[edit]
A Tighter Problem-Dependent Regret Bound for Risk-Sensitive Reinforcement Learning
Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, PMLR 206:5411-5437, 2023.
Abstract
We study the regret for risk-sensitive reinforcement learning (RL) with the exponential utility in the episodic MDP. Recent works establish both a lower bound Ω((e|β|(H−1)/2−1)√SAT/|β|) and the best known (upper) bound ˜O((e|β|H−1)√H2SAT/|β|), where H is the length of the episode, S the size of state space, A the size of action space, T the total number of timesteps, and β the risk parameter. The gap between the upper and the lower bound is exponential and hence is unsatisfactory. In this paper, we show that a variant of UCB-Advantage algorithm reduces a factor of √H from the best previously known bound in any arbitrary MDP. To further sharpen the regret bound, we introduce a brand new mechanism of regret analysis and derive a problem-dependent regret bound without prior knowledge of the MDP from the algorithm. This bound is much tighter in MDPs with special structures. Particularly, we show that a regret that matches the information-theoretic lower bound up to logarithmic factors can be attained within a rich class of MDPs, which improves an exponential factor over the best previously known bound. Further, we derive a novel information-theoretic lower bound of Ω(max, where \max_{h\in[H]} c_{v,h+1}^* is a problem-dependent statistic. This lower bound shows that the problem-dependent regret bound achieved by the algorithm is optimal in its dependence on \max_{h\in[H]} c_{v,h+1}^*.