No-regret learning with high-probability in adversarial Markov decision processes

Mahsa Ghasemi, Abolfazl Hashemi, Haris Vikalo, Ufuk Topcu
Proceedings of the Thirty-Seventh Conference on Uncertainty in Artificial Intelligence, PMLR 161:992-1001, 2021.

Abstract

In a variety of problems, a decision-maker is unaware of the loss function associated with a task, yet it has to minimize this unknown loss in order to accomplish the task. Furthermore, the decision-maker’s task may evolve, resulting in a varying loss function. In this setting, we explore sequential decision-making problems modeled by adversarial Markov decision processes, where the loss function may arbitrarily change at every time step. We consider the bandit feedback scenario, where the agent observes only the loss corresponding to its actions. We propose an algorithm, called online relative-entropy policy search with implicit exploration, that achieves a sublinear regret not only in expectation but, more importantly, with high probability. In particular, we prove that by employing an optimistically biased loss estimator, the proposed algorithm achieves a regret of $\tilde{\mathcal{O}}((T|\act||\st|)^{\pp} \sqrt{\tau})$, where $|\st|$ is the number of states, $|\act|$ is the number of actions, $\tau$ is the mixing time, and $T$ is the time horizon. To our knowledge, the proposed algorithm is the first scheme that enjoys such high-probability regret bounds for general adversarial Markov decision processes under the presence of bandit feedback.

Cite this Paper


BibTeX
@InProceedings{pmlr-v161-ghasemi21a, title = {No-regret learning with high-probability in adversarial Markov decision processes}, author = {Ghasemi, Mahsa and Hashemi, Abolfazl and Vikalo, Haris and Topcu, Ufuk}, booktitle = {Proceedings of the Thirty-Seventh Conference on Uncertainty in Artificial Intelligence}, pages = {992--1001}, year = {2021}, editor = {de Campos, Cassio and Maathuis, Marloes H.}, volume = {161}, series = {Proceedings of Machine Learning Research}, month = {27--30 Jul}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v161/ghasemi21a/ghasemi21a.pdf}, url = {https://proceedings.mlr.press/v161/ghasemi21a.html}, abstract = {In a variety of problems, a decision-maker is unaware of the loss function associated with a task, yet it has to minimize this unknown loss in order to accomplish the task. Furthermore, the decision-maker’s task may evolve, resulting in a varying loss function. In this setting, we explore sequential decision-making problems modeled by adversarial Markov decision processes, where the loss function may arbitrarily change at every time step. We consider the bandit feedback scenario, where the agent observes only the loss corresponding to its actions. We propose an algorithm, called online relative-entropy policy search with implicit exploration, that achieves a sublinear regret not only in expectation but, more importantly, with high probability. In particular, we prove that by employing an optimistically biased loss estimator, the proposed algorithm achieves a regret of $\tilde{\mathcal{O}}((T|\act||\st|)^{\pp} \sqrt{\tau})$, where $|\st|$ is the number of states, $|\act|$ is the number of actions, $\tau$ is the mixing time, and $T$ is the time horizon. To our knowledge, the proposed algorithm is the first scheme that enjoys such high-probability regret bounds for general adversarial Markov decision processes under the presence of bandit feedback.} }
Endnote
%0 Conference Paper %T No-regret learning with high-probability in adversarial Markov decision processes %A Mahsa Ghasemi %A Abolfazl Hashemi %A Haris Vikalo %A Ufuk Topcu %B Proceedings of the Thirty-Seventh Conference on Uncertainty in Artificial Intelligence %C Proceedings of Machine Learning Research %D 2021 %E Cassio de Campos %E Marloes H. Maathuis %F pmlr-v161-ghasemi21a %I PMLR %P 992--1001 %U https://proceedings.mlr.press/v161/ghasemi21a.html %V 161 %X In a variety of problems, a decision-maker is unaware of the loss function associated with a task, yet it has to minimize this unknown loss in order to accomplish the task. Furthermore, the decision-maker’s task may evolve, resulting in a varying loss function. In this setting, we explore sequential decision-making problems modeled by adversarial Markov decision processes, where the loss function may arbitrarily change at every time step. We consider the bandit feedback scenario, where the agent observes only the loss corresponding to its actions. We propose an algorithm, called online relative-entropy policy search with implicit exploration, that achieves a sublinear regret not only in expectation but, more importantly, with high probability. In particular, we prove that by employing an optimistically biased loss estimator, the proposed algorithm achieves a regret of $\tilde{\mathcal{O}}((T|\act||\st|)^{\pp} \sqrt{\tau})$, where $|\st|$ is the number of states, $|\act|$ is the number of actions, $\tau$ is the mixing time, and $T$ is the time horizon. To our knowledge, the proposed algorithm is the first scheme that enjoys such high-probability regret bounds for general adversarial Markov decision processes under the presence of bandit feedback.
APA
Ghasemi, M., Hashemi, A., Vikalo, H. & Topcu, U.. (2021). No-regret learning with high-probability in adversarial Markov decision processes. Proceedings of the Thirty-Seventh Conference on Uncertainty in Artificial Intelligence, in Proceedings of Machine Learning Research 161:992-1001 Available from https://proceedings.mlr.press/v161/ghasemi21a.html.

Related Material