Improved No-Regret Algorithms for Stochastic Shortest Path with Linear MDP

Liyu Chen; Rahul Jain; Haipeng Luo

Improved No-Regret Algorithms for Stochastic Shortest Path with Linear MDP

Liyu Chen, Rahul Jain, Haipeng Luo

Proceedings of the 39th International Conference on Machine Learning, PMLR 162:3204-3245, 2022.

Abstract

We introduce two new no-regret algorithms for the stochastic shortest path (SSP) problem with a linear MDP that significantly improve over the only existing results of (Vial et al., 2021). Our first algorithm is computationally efficient and achieves a regret bound

$O(\sqrt{d^3B_{\star}^2T_{\star} K})$ , where

$d$ is the dimension of the feature space,

$B_{\star}$ and

$T_{\star}$ are upper bounds of the expected costs and hitting time of the optimal policy respectively, and

$K$ is the number of episodes. The same algorithm with a slight modification also achieves logarithmic regret of order

$O(\frac{d^3B_{\star}^4}{c_{\min}^2\text{\rm gap}_{\min} }\ln^5\frac{dB_{\star} K}{c_{\min}})$ , where

$\text{\rm gap}_{\min}$ is the minimum sub-optimality gap and

$c_{\min}$ is the minimum cost over all state-action pairs. Our result is obtained by developing a simpler and improved analysis for the finite-horizon approximation of (Cohen et al., 2021) with a smaller approximation error, which might be of independent interest. On the other hand, using variance-aware confidence sets in a global optimization problem, our second algorithm is computationally inefficient but achieves the first “horizon-free” regret bound

$O(d^{3.5}B_{\star}\sqrt{K})$ with no polynomial dependency on

$T_{\star}$ or

$1/c_{\min}$ , almost matching the

$\Omega(dB_{\star}\sqrt{K})$ lower bound from (Min et al., 2021).

Cite this Paper

BibTeX


@InProceedings{pmlr-v162-chen22h,
  title = 	 {Improved No-Regret Algorithms for Stochastic Shortest Path with Linear {MDP}},
  author =       {Chen, Liyu and Jain, Rahul and Luo, Haipeng},
  booktitle = 	 {Proceedings of the 39th International Conference on Machine Learning},
  pages = 	 {3204--3245},
  year = 	 {2022},
  editor = 	 {Chaudhuri, Kamalika and Jegelka, Stefanie and Song, Le and Szepesvari, Csaba and Niu, Gang and Sabato, Sivan},
  volume = 	 {162},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {17--23 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v162/chen22h/chen22h.pdf},
  url = 	 {https://proceedings.mlr.press/v162/chen22h.html},
  abstract = 	 {We introduce two new no-regret algorithms for the stochastic shortest path (SSP) problem with a linear MDP that significantly improve over the only existing results of (Vial et al., 2021). Our first algorithm is computationally efficient and achieves a regret bound $O(\sqrt{d^3B_{\star}^2T_{\star} K})$, where $d$ is the dimension of the feature space, $B_{\star}$ and $T_{\star}$ are upper bounds of the expected costs and hitting time of the optimal policy respectively, and $K$ is the number of episodes. The same algorithm with a slight modification also achieves logarithmic regret of order $O(\frac{d^3B_{\star}^4}{c_{\min}^2\text{\rm gap}_{\min} }\ln^5\frac{dB_{\star} K}{c_{\min}})$, where $\text{\rm gap}_{\min}$ is the minimum sub-optimality gap and $c_{\min}$ is the minimum cost over all state-action pairs. Our result is obtained by developing a simpler and improved analysis for the finite-horizon approximation of (Cohen et al., 2021) with a smaller approximation error, which might be of independent interest. On the other hand, using variance-aware confidence sets in a global optimization problem, our second algorithm is computationally inefficient but achieves the first “horizon-free” regret bound $O(d^{3.5}B_{\star}\sqrt{K})$ with no polynomial dependency on $T_{\star}$ or $1/c_{\min}$, almost matching the $\Omega(dB_{\star}\sqrt{K})$ lower bound from (Min et al., 2021).}
}

Endnote

%0 Conference Paper
%T Improved No-Regret Algorithms for Stochastic Shortest Path with Linear MDP
%A Liyu Chen
%A Rahul Jain
%A Haipeng Luo
%B Proceedings of the 39th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2022
%E Kamalika Chaudhuri
%E Stefanie Jegelka
%E Le Song
%E Csaba Szepesvari
%E Gang Niu
%E Sivan Sabato	
%F pmlr-v162-chen22h
%I PMLR
%P 3204--3245
%U https://proceedings.mlr.press/v162/chen22h.html
%V 162
%X We introduce two new no-regret algorithms for the stochastic shortest path (SSP) problem with a linear MDP that significantly improve over the only existing results of (Vial et al., 2021). Our first algorithm is computationally efficient and achieves a regret bound $O(\sqrt{d^3B_{\star}^2T_{\star} K})$, where $d$ is the dimension of the feature space, $B_{\star}$ and $T_{\star}$ are upper bounds of the expected costs and hitting time of the optimal policy respectively, and $K$ is the number of episodes. The same algorithm with a slight modification also achieves logarithmic regret of order $O(\frac{d^3B_{\star}^4}{c_{\min}^2\text{\rm gap}_{\min} }\ln^5\frac{dB_{\star} K}{c_{\min}})$, where $\text{\rm gap}_{\min}$ is the minimum sub-optimality gap and $c_{\min}$ is the minimum cost over all state-action pairs. Our result is obtained by developing a simpler and improved analysis for the finite-horizon approximation of (Cohen et al., 2021) with a smaller approximation error, which might be of independent interest. On the other hand, using variance-aware confidence sets in a global optimization problem, our second algorithm is computationally inefficient but achieves the first “horizon-free” regret bound $O(d^{3.5}B_{\star}\sqrt{K})$ with no polynomial dependency on $T_{\star}$ or $1/c_{\min}$, almost matching the $\Omega(dB_{\star}\sqrt{K})$ lower bound from (Min et al., 2021).

APA


Chen, L., Jain, R. & Luo, H.. (2022). Improved No-Regret Algorithms for Stochastic Shortest Path with Linear MDP. Proceedings of the 39th International Conference on Machine Learning, in Proceedings of Machine Learning Research 162:3204-3245 Available from https://proceedings.mlr.press/v162/chen22h.html.

Related Material

Download PDF