A Tighter Problem-Dependent Regret Bound for Risk-Sensitive Reinforcement Learning

Xiaoyan Hu, Ho-Fung Leung
Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, PMLR 206:5411-5437, 2023.

Abstract

We study the regret for risk-sensitive reinforcement learning (RL) with the exponential utility in the episodic MDP. Recent works establish both a lower bound $\Omega((e^{|\beta|(H-1)/2}-1)\sqrt{SAT}/|\beta|)$ and the best known (upper) bound $\tilde{O}((e^{|\beta|H}-1)\sqrt{H^2SAT}/|\beta|)$, where $H$ is the length of the episode, $S$ the size of state space, $A$ the size of action space, $T$ the total number of timesteps, and $\beta$ the risk parameter. The gap between the upper and the lower bound is exponential and hence is unsatisfactory. In this paper, we show that a variant of UCB-Advantage algorithm reduces a factor of $\sqrt{H}$ from the best previously known bound in any arbitrary MDP. To further sharpen the regret bound, we introduce a brand new mechanism of regret analysis and derive a problem-dependent regret bound without prior knowledge of the MDP from the algorithm. This bound is much tighter in MDPs with special structures. Particularly, we show that a regret that matches the information-theoretic lower bound up to logarithmic factors can be attained within a rich class of MDPs, which improves an exponential factor over the best previously known bound. Further, we derive a novel information-theoretic lower bound of $\Omega(\max_{h\in[H]} c_{v,h+1}^*\sqrt{SAT}/|\beta|)$, where $\max_{h\in[H]} c_{v,h+1}^*$ is a problem-dependent statistic. This lower bound shows that the problem-dependent regret bound achieved by the algorithm is optimal in its dependence on $\max_{h\in[H]} c_{v,h+1}^*$.

Cite this Paper


BibTeX
@InProceedings{pmlr-v206-hu23b, title = {A Tighter Problem-Dependent Regret Bound for Risk-Sensitive Reinforcement Learning}, author = {Hu, Xiaoyan and Leung, Ho-Fung}, booktitle = {Proceedings of The 26th International Conference on Artificial Intelligence and Statistics}, pages = {5411--5437}, year = {2023}, editor = {Ruiz, Francisco and Dy, Jennifer and van de Meent, Jan-Willem}, volume = {206}, series = {Proceedings of Machine Learning Research}, month = {25--27 Apr}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v206/hu23b/hu23b.pdf}, url = {https://proceedings.mlr.press/v206/hu23b.html}, abstract = {We study the regret for risk-sensitive reinforcement learning (RL) with the exponential utility in the episodic MDP. Recent works establish both a lower bound $\Omega((e^{|\beta|(H-1)/2}-1)\sqrt{SAT}/|\beta|)$ and the best known (upper) bound $\tilde{O}((e^{|\beta|H}-1)\sqrt{H^2SAT}/|\beta|)$, where $H$ is the length of the episode, $S$ the size of state space, $A$ the size of action space, $T$ the total number of timesteps, and $\beta$ the risk parameter. The gap between the upper and the lower bound is exponential and hence is unsatisfactory. In this paper, we show that a variant of UCB-Advantage algorithm reduces a factor of $\sqrt{H}$ from the best previously known bound in any arbitrary MDP. To further sharpen the regret bound, we introduce a brand new mechanism of regret analysis and derive a problem-dependent regret bound without prior knowledge of the MDP from the algorithm. This bound is much tighter in MDPs with special structures. Particularly, we show that a regret that matches the information-theoretic lower bound up to logarithmic factors can be attained within a rich class of MDPs, which improves an exponential factor over the best previously known bound. Further, we derive a novel information-theoretic lower bound of $\Omega(\max_{h\in[H]} c_{v,h+1}^*\sqrt{SAT}/|\beta|)$, where $\max_{h\in[H]} c_{v,h+1}^*$ is a problem-dependent statistic. This lower bound shows that the problem-dependent regret bound achieved by the algorithm is optimal in its dependence on $\max_{h\in[H]} c_{v,h+1}^*$.} }
Endnote
%0 Conference Paper %T A Tighter Problem-Dependent Regret Bound for Risk-Sensitive Reinforcement Learning %A Xiaoyan Hu %A Ho-Fung Leung %B Proceedings of The 26th International Conference on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 2023 %E Francisco Ruiz %E Jennifer Dy %E Jan-Willem van de Meent %F pmlr-v206-hu23b %I PMLR %P 5411--5437 %U https://proceedings.mlr.press/v206/hu23b.html %V 206 %X We study the regret for risk-sensitive reinforcement learning (RL) with the exponential utility in the episodic MDP. Recent works establish both a lower bound $\Omega((e^{|\beta|(H-1)/2}-1)\sqrt{SAT}/|\beta|)$ and the best known (upper) bound $\tilde{O}((e^{|\beta|H}-1)\sqrt{H^2SAT}/|\beta|)$, where $H$ is the length of the episode, $S$ the size of state space, $A$ the size of action space, $T$ the total number of timesteps, and $\beta$ the risk parameter. The gap between the upper and the lower bound is exponential and hence is unsatisfactory. In this paper, we show that a variant of UCB-Advantage algorithm reduces a factor of $\sqrt{H}$ from the best previously known bound in any arbitrary MDP. To further sharpen the regret bound, we introduce a brand new mechanism of regret analysis and derive a problem-dependent regret bound without prior knowledge of the MDP from the algorithm. This bound is much tighter in MDPs with special structures. Particularly, we show that a regret that matches the information-theoretic lower bound up to logarithmic factors can be attained within a rich class of MDPs, which improves an exponential factor over the best previously known bound. Further, we derive a novel information-theoretic lower bound of $\Omega(\max_{h\in[H]} c_{v,h+1}^*\sqrt{SAT}/|\beta|)$, where $\max_{h\in[H]} c_{v,h+1}^*$ is a problem-dependent statistic. This lower bound shows that the problem-dependent regret bound achieved by the algorithm is optimal in its dependence on $\max_{h\in[H]} c_{v,h+1}^*$.
APA
Hu, X. & Leung, H.. (2023). A Tighter Problem-Dependent Regret Bound for Risk-Sensitive Reinforcement Learning. Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 206:5411-5437 Available from https://proceedings.mlr.press/v206/hu23b.html.

Related Material