A Tighter Problem-Dependent Regret Bound for Risk-Sensitive Reinforcement Learning

Xiaoyan Hu; Ho-Fung Leung

A Tighter Problem-Dependent Regret Bound for Risk-Sensitive Reinforcement Learning

Xiaoyan Hu, Ho-Fung Leung

Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, PMLR 206:5411-5437, 2023.

Abstract

We study the regret for risk-sensitive reinforcement learning (RL) with the exponential utility in the episodic MDP. Recent works establish both a lower bound

$\Omega((e^{|\beta|(H-1)/2}-1)\sqrt{SAT}/|\beta|)$ and the best known (upper) bound

$\tilde{O}((e^{|\beta|H}-1)\sqrt{H^2SAT}/|\beta|)$ , where

$H$ is the length of the episode,

$S$ the size of state space,

$A$ the size of action space,

$T$ the total number of timesteps, and

$\beta$ the risk parameter. The gap between the upper and the lower bound is exponential and hence is unsatisfactory. In this paper, we show that a variant of UCB-Advantage algorithm reduces a factor of

$\sqrt{H}$ from the best previously known bound in any arbitrary MDP. To further sharpen the regret bound, we introduce a brand new mechanism of regret analysis and derive a problem-dependent regret bound without prior knowledge of the MDP from the algorithm. This bound is much tighter in MDPs with special structures. Particularly, we show that a regret that matches the information-theoretic lower bound up to logarithmic factors can be attained within a rich class of MDPs, which improves an exponential factor over the best previously known bound. Further, we derive a novel information-theoretic lower bound of

$\Omega(\max_{h\in[H]} c_{v,h+1}^*\sqrt{SAT}/|\beta|)$ , where

$\max_{h\in[H]} c_{v,h+1}^*$ is a problem-dependent statistic. This lower bound shows that the problem-dependent regret bound achieved by the algorithm is optimal in its dependence on

$\max_{h\in[H]} c_{v,h+1}^*$ .

Cite this Paper

BibTeX


@InProceedings{pmlr-v206-hu23b,
  title = 	 {A Tighter Problem-Dependent Regret Bound for Risk-Sensitive Reinforcement Learning},
  author =       {Hu, Xiaoyan and Leung, Ho-Fung},
  booktitle = 	 {Proceedings of The 26th International Conference on Artificial Intelligence and Statistics},
  pages = 	 {5411--5437},
  year = 	 {2023},
  editor = 	 {Ruiz, Francisco and Dy, Jennifer and van de Meent, Jan-Willem},
  volume = 	 {206},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {25--27 Apr},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v206/hu23b/hu23b.pdf},
  url = 	 {https://proceedings.mlr.press/v206/hu23b.html},
  abstract = 	 {We study the regret for risk-sensitive reinforcement learning (RL) with the exponential utility in the episodic MDP. Recent works establish both a lower bound $\Omega((e^{|\beta|(H-1)/2}-1)\sqrt{SAT}/|\beta|)$ and the best known (upper) bound $\tilde{O}((e^{|\beta|H}-1)\sqrt{H^2SAT}/|\beta|)$, where $H$ is the length of the episode, $S$ the size of state space, $A$ the size of action space, $T$ the total number of timesteps, and $\beta$ the risk parameter. The gap between the upper and the lower bound is exponential and hence is unsatisfactory. In this paper, we show that a variant of UCB-Advantage algorithm reduces a factor of $\sqrt{H}$ from the best previously known bound in any arbitrary MDP. To further sharpen the regret bound, we introduce a brand new mechanism of regret analysis and derive a problem-dependent regret bound without prior knowledge of the MDP from the algorithm. This bound is much tighter in MDPs with special structures. Particularly, we show that a regret that matches the information-theoretic lower bound up to logarithmic factors can be attained within a rich class of MDPs, which improves an exponential factor over the best previously known bound. Further, we derive a novel information-theoretic lower bound of $\Omega(\max_{h\in[H]} c_{v,h+1}^*\sqrt{SAT}/|\beta|)$, where $\max_{h\in[H]} c_{v,h+1}^*$ is a problem-dependent statistic. This lower bound shows that the problem-dependent regret bound achieved by the algorithm is optimal in its dependence on $\max_{h\in[H]} c_{v,h+1}^*$.}
}

Endnote

%0 Conference Paper
%T A Tighter Problem-Dependent Regret Bound for Risk-Sensitive Reinforcement Learning
%A Xiaoyan Hu
%A Ho-Fung Leung
%B Proceedings of The 26th International Conference on Artificial Intelligence and Statistics
%C Proceedings of Machine Learning Research
%D 2023
%E Francisco Ruiz
%E Jennifer Dy
%E Jan-Willem van de Meent	
%F pmlr-v206-hu23b
%I PMLR
%P 5411--5437
%U https://proceedings.mlr.press/v206/hu23b.html
%V 206
%X We study the regret for risk-sensitive reinforcement learning (RL) with the exponential utility in the episodic MDP. Recent works establish both a lower bound $\Omega((e^{|\beta|(H-1)/2}-1)\sqrt{SAT}/|\beta|)$ and the best known (upper) bound $\tilde{O}((e^{|\beta|H}-1)\sqrt{H^2SAT}/|\beta|)$, where $H$ is the length of the episode, $S$ the size of state space, $A$ the size of action space, $T$ the total number of timesteps, and $\beta$ the risk parameter. The gap between the upper and the lower bound is exponential and hence is unsatisfactory. In this paper, we show that a variant of UCB-Advantage algorithm reduces a factor of $\sqrt{H}$ from the best previously known bound in any arbitrary MDP. To further sharpen the regret bound, we introduce a brand new mechanism of regret analysis and derive a problem-dependent regret bound without prior knowledge of the MDP from the algorithm. This bound is much tighter in MDPs with special structures. Particularly, we show that a regret that matches the information-theoretic lower bound up to logarithmic factors can be attained within a rich class of MDPs, which improves an exponential factor over the best previously known bound. Further, we derive a novel information-theoretic lower bound of $\Omega(\max_{h\in[H]} c_{v,h+1}^*\sqrt{SAT}/|\beta|)$, where $\max_{h\in[H]} c_{v,h+1}^*$ is a problem-dependent statistic. This lower bound shows that the problem-dependent regret bound achieved by the algorithm is optimal in its dependence on $\max_{h\in[H]} c_{v,h+1}^*$.

APA


Hu, X. & Leung, H.. (2023). A Tighter Problem-Dependent Regret Bound for Risk-Sensitive Reinforcement Learning. Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 206:5411-5437 Available from https://proceedings.mlr.press/v206/hu23b.html.

Related Material

Download PDF