[edit]

# A Tighter Problem-Dependent Regret Bound for Risk-Sensitive Reinforcement Learning

*Proceedings of The 26th International Conference on Artificial Intelligence and Statistics*, PMLR 206:5411-5437, 2023.

#### Abstract

We study the regret for risk-sensitive reinforcement learning (RL) with the exponential utility in the episodic MDP. Recent works establish both a lower bound $\Omega((e^{|\beta|(H-1)/2}-1)\sqrt{SAT}/|\beta|)$ and the best known (upper) bound $\tilde{O}((e^{|\beta|H}-1)\sqrt{H^2SAT}/|\beta|)$, where $H$ is the length of the episode, $S$ the size of state space, $A$ the size of action space, $T$ the total number of timesteps, and $\beta$ the risk parameter. The gap between the upper and the lower bound is exponential and hence is unsatisfactory. In this paper, we show that a variant of UCB-Advantage algorithm reduces a factor of $\sqrt{H}$ from the best previously known bound in any arbitrary MDP. To further sharpen the regret bound, we introduce a brand new mechanism of regret analysis and derive a problem-dependent regret bound without prior knowledge of the MDP from the algorithm. This bound is much tighter in MDPs with special structures. Particularly, we show that a regret that matches the information-theoretic lower bound up to logarithmic factors can be attained within a rich class of MDPs, which improves an exponential factor over the best previously known bound. Further, we derive a novel information-theoretic lower bound of $\Omega(\max_{h\in[H]} c_{v,h+1}^*\sqrt{SAT}/|\beta|)$, where $\max_{h\in[H]} c_{v,h+1}^*$ is a problem-dependent statistic. This lower bound shows that the problem-dependent regret bound achieved by the algorithm is optimal in its dependence on $\max_{h\in[H]} c_{v,h+1}^*$.