Minimax Regret for Stochastic Shortest Path with Adversarial Costs and Known Transition

Liyu Chen, Haipeng Luo, Chen-Yu Wei
Proceedings of Thirty Fourth Conference on Learning Theory, PMLR 134:1180-1215, 2021.

Abstract

We study the stochastic shortest path problem with adversarial costs and known transition, and show that the minimax regret is $O(\sqrt{DT_\star K})$ and $O(\sqrt{DT_\star SA K})$ for the full-information setting and the bandit feedback setting respectively, where $D$ is the diameter, $T_\star$ is the expected hitting time of the optimal policy, $S$ is the number of states, $A$ is the number of actions, and $K$ is the number of episodes. Our results significantly improve upon the recent work of (Rosenberg and Mansour, 2020) which only considers the full-information setting and achieves suboptimal regret. Our work is also the first to consider bandit feedback with adversarial costs. Our algorithms are built on top of the Online Mirror Descent framework with a variety of new techniques that might be of independent interest, including an improved multi-scale expert algorithm, a reduction from general stochastic shortest path to a special loop-free case, a skewed occupancy measure space, and a novel correction term added to the cost estimators. Interestingly, the last two elements reduce the variance of the learner via positive bias and the variance of the optimal policy via negative bias respectively, and having them simultaneously is critical for obtaining the optimal high-probability bound in the bandit feedback setting.

Cite this Paper


BibTeX
@InProceedings{pmlr-v134-chen21e, title = {Minimax Regret for Stochastic Shortest Path with Adversarial Costs and Known Transition}, author = {Chen, Liyu and Luo, Haipeng and Wei, Chen-Yu}, booktitle = {Proceedings of Thirty Fourth Conference on Learning Theory}, pages = {1180--1215}, year = {2021}, editor = {Belkin, Mikhail and Kpotufe, Samory}, volume = {134}, series = {Proceedings of Machine Learning Research}, month = {15--19 Aug}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v134/chen21e/chen21e.pdf}, url = {https://proceedings.mlr.press/v134/chen21e.html}, abstract = {We study the stochastic shortest path problem with adversarial costs and known transition, and show that the minimax regret is $O(\sqrt{DT_\star K})$ and $O(\sqrt{DT_\star SA K})$ for the full-information setting and the bandit feedback setting respectively, where $D$ is the diameter, $T_\star$ is the expected hitting time of the optimal policy, $S$ is the number of states, $A$ is the number of actions, and $K$ is the number of episodes. Our results significantly improve upon the recent work of (Rosenberg and Mansour, 2020) which only considers the full-information setting and achieves suboptimal regret. Our work is also the first to consider bandit feedback with adversarial costs. Our algorithms are built on top of the Online Mirror Descent framework with a variety of new techniques that might be of independent interest, including an improved multi-scale expert algorithm, a reduction from general stochastic shortest path to a special loop-free case, a skewed occupancy measure space, and a novel correction term added to the cost estimators. Interestingly, the last two elements reduce the variance of the learner via positive bias and the variance of the optimal policy via negative bias respectively, and having them simultaneously is critical for obtaining the optimal high-probability bound in the bandit feedback setting.} }
Endnote
%0 Conference Paper %T Minimax Regret for Stochastic Shortest Path with Adversarial Costs and Known Transition %A Liyu Chen %A Haipeng Luo %A Chen-Yu Wei %B Proceedings of Thirty Fourth Conference on Learning Theory %C Proceedings of Machine Learning Research %D 2021 %E Mikhail Belkin %E Samory Kpotufe %F pmlr-v134-chen21e %I PMLR %P 1180--1215 %U https://proceedings.mlr.press/v134/chen21e.html %V 134 %X We study the stochastic shortest path problem with adversarial costs and known transition, and show that the minimax regret is $O(\sqrt{DT_\star K})$ and $O(\sqrt{DT_\star SA K})$ for the full-information setting and the bandit feedback setting respectively, where $D$ is the diameter, $T_\star$ is the expected hitting time of the optimal policy, $S$ is the number of states, $A$ is the number of actions, and $K$ is the number of episodes. Our results significantly improve upon the recent work of (Rosenberg and Mansour, 2020) which only considers the full-information setting and achieves suboptimal regret. Our work is also the first to consider bandit feedback with adversarial costs. Our algorithms are built on top of the Online Mirror Descent framework with a variety of new techniques that might be of independent interest, including an improved multi-scale expert algorithm, a reduction from general stochastic shortest path to a special loop-free case, a skewed occupancy measure space, and a novel correction term added to the cost estimators. Interestingly, the last two elements reduce the variance of the learner via positive bias and the variance of the optimal policy via negative bias respectively, and having them simultaneously is critical for obtaining the optimal high-probability bound in the bandit feedback setting.
APA
Chen, L., Luo, H. & Wei, C.. (2021). Minimax Regret for Stochastic Shortest Path with Adversarial Costs and Known Transition. Proceedings of Thirty Fourth Conference on Learning Theory, in Proceedings of Machine Learning Research 134:1180-1215 Available from https://proceedings.mlr.press/v134/chen21e.html.

Related Material