Dynamic Regret of Online Markov Decision Processes

Peng Zhao, Long-Fei Li, Zhi-Hua Zhou
Proceedings of the 39th International Conference on Machine Learning, PMLR 162:26865-26894, 2022.

Abstract

We investigate online Markov Decision Processes (MDPs) with adversarially changing loss functions and known transitions. We choose dynamic regret as the performance measure, defined as the performance difference between the learner and any sequence of feasible changing policies. The measure is strictly stronger than the standard static regret that benchmarks the learner’s performance with a fixed compared policy. We consider three foundational models of online MDPs, including episodic loop-free Stochastic Shortest Path (SSP), episodic SSP, and infinite-horizon MDPs. For the three models, we propose novel online ensemble algorithms and establish their dynamic regret guarantees respectively, in which the results for episodic (loop-free) SSP are provably minimax optimal in terms of time horizon and certain non-stationarity measure.

Cite this Paper


BibTeX
@InProceedings{pmlr-v162-zhao22c, title = {Dynamic Regret of Online {M}arkov Decision Processes}, author = {Zhao, Peng and Li, Long-Fei and Zhou, Zhi-Hua}, booktitle = {Proceedings of the 39th International Conference on Machine Learning}, pages = {26865--26894}, year = {2022}, editor = {Chaudhuri, Kamalika and Jegelka, Stefanie and Song, Le and Szepesvari, Csaba and Niu, Gang and Sabato, Sivan}, volume = {162}, series = {Proceedings of Machine Learning Research}, month = {17--23 Jul}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v162/zhao22c/zhao22c.pdf}, url = {https://proceedings.mlr.press/v162/zhao22c.html}, abstract = {We investigate online Markov Decision Processes (MDPs) with adversarially changing loss functions and known transitions. We choose dynamic regret as the performance measure, defined as the performance difference between the learner and any sequence of feasible changing policies. The measure is strictly stronger than the standard static regret that benchmarks the learner’s performance with a fixed compared policy. We consider three foundational models of online MDPs, including episodic loop-free Stochastic Shortest Path (SSP), episodic SSP, and infinite-horizon MDPs. For the three models, we propose novel online ensemble algorithms and establish their dynamic regret guarantees respectively, in which the results for episodic (loop-free) SSP are provably minimax optimal in terms of time horizon and certain non-stationarity measure.} }
Endnote
%0 Conference Paper %T Dynamic Regret of Online Markov Decision Processes %A Peng Zhao %A Long-Fei Li %A Zhi-Hua Zhou %B Proceedings of the 39th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2022 %E Kamalika Chaudhuri %E Stefanie Jegelka %E Le Song %E Csaba Szepesvari %E Gang Niu %E Sivan Sabato %F pmlr-v162-zhao22c %I PMLR %P 26865--26894 %U https://proceedings.mlr.press/v162/zhao22c.html %V 162 %X We investigate online Markov Decision Processes (MDPs) with adversarially changing loss functions and known transitions. We choose dynamic regret as the performance measure, defined as the performance difference between the learner and any sequence of feasible changing policies. The measure is strictly stronger than the standard static regret that benchmarks the learner’s performance with a fixed compared policy. We consider three foundational models of online MDPs, including episodic loop-free Stochastic Shortest Path (SSP), episodic SSP, and infinite-horizon MDPs. For the three models, we propose novel online ensemble algorithms and establish their dynamic regret guarantees respectively, in which the results for episodic (loop-free) SSP are provably minimax optimal in terms of time horizon and certain non-stationarity measure.
APA
Zhao, P., Li, L. & Zhou, Z.. (2022). Dynamic Regret of Online Markov Decision Processes. Proceedings of the 39th International Conference on Machine Learning, in Proceedings of Machine Learning Research 162:26865-26894 Available from https://proceedings.mlr.press/v162/zhao22c.html.

Related Material