[edit]
The adversarial stochastic shortest path problem with unknown transition probabilities
Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics, PMLR 22:805-813, 2012.
Abstract
We consider online learning in a special class of episodic Markovian decision processes, namely, loop-free stochastic shortest path problems. In this problem, an agent has to traverse through a finite directed acyclic graph with random transitions while maximizing the obtained rewards along the way. We assume that the reward function can change arbitrarily between consecutive episodes, and is entirely revealed to the agent at the end of each episode. Previous work was concerned with the case when the stochastic dynamics is known ahead of time, whereas the main novelty of this paper is that this assumption is lifted. We propose an algorithm called “follow the perturbed optimistic policy” that combines ideas from the “follow the perturbed leader” method for online learning of arbitrary sequences and “upper confidence reinforcement learning”, an algorithm for regret minimization in Markovian decision processes (with a fixed reward function). We prove that the expected cumulative regret of our algorithm is of order L X A\sqrtT up to logarithmic factors, where L is the length of the longest path in the graph, \X is the state space, \A is the action space and T is the number of episodes. To our knowledge this is the first algorithm that learns and controls stochastic and adversarial components in an online fashion at the same time.