[edit]
No-regret learning with high-probability in adversarial Markov decision processes
Proceedings of the Thirty-Seventh Conference on Uncertainty in Artificial Intelligence, PMLR 161:992-1001, 2021.
Abstract
In a variety of problems, a decision-maker is unaware of the loss function associated with a task, yet it has to minimize this unknown loss in order to accomplish the task. Furthermore, the decision-maker’s task may evolve, resulting in a varying loss function. In this setting, we explore sequential decision-making problems modeled by adversarial Markov decision processes, where the loss function may arbitrarily change at every time step. We consider the bandit feedback scenario, where the agent observes only the loss corresponding to its actions. We propose an algorithm, called online relative-entropy policy search with implicit exploration, that achieves a sublinear regret not only in expectation but, more importantly, with high probability. In particular, we prove that by employing an optimistically biased loss estimator, the proposed algorithm achieves a regret of $\tilde{\mathcal{O}}((T|\act||\st|)^{\pp} \sqrt{\tau})$, where $|\st|$ is the number of states, $|\act|$ is the number of actions, $\tau$ is the mixing time, and $T$ is the time horizon. To our knowledge, the proposed algorithm is the first scheme that enjoys such high-probability regret bounds for general adversarial Markov decision processes under the presence of bandit feedback.