The adversarial stochastic shortest path problem with unknown transition probabilities

Gergely Neu; Andras Gyorgy; Csaba Szepesvari

The adversarial stochastic shortest path problem with unknown transition probabilities

Gergely Neu, Andras Gyorgy, Csaba Szepesvari

Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics, PMLR 22:805-813, 2012.

Abstract

We consider online learning in a special class of episodic Markovian decision processes, namely, loop-free stochastic shortest path problems. In this problem, an agent has to traverse through a finite directed acyclic graph with random transitions while maximizing the obtained rewards along the way. We assume that the reward function can change arbitrarily between consecutive episodes, and is entirely revealed to the agent at the end of each episode. Previous work was concerned with the case when the stochastic dynamics is known ahead of time, whereas the main novelty of this paper is that this assumption is lifted. We propose an algorithm called “follow the perturbed optimistic policy” that combines ideas from the “follow the perturbed leader” method for online learning of arbitrary sequences and “upper confidence reinforcement learning”, an algorithm for regret minimization in Markovian decision processes (with a fixed reward function). We prove that the expected cumulative regret of our algorithm is of order L X A\sqrtT up to logarithmic factors, where L is the length of the longest path in the graph, \X is the state space, \A is the action space and T is the number of episodes. To our knowledge this is the first algorithm that learns and controls stochastic and adversarial components in an online fashion at the same time.

Cite this Paper

BibTeX


@InProceedings{pmlr-v22-neu12,
  title = 	 {The adversarial stochastic shortest path problem with unknown transition probabilities},
  author = 	 {Neu, Gergely and Gyorgy, Andras and Szepesvari, Csaba},
  booktitle = 	 {Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics},
  pages = 	 {805--813},
  year = 	 {2012},
  editor = 	 {Lawrence, Neil D. and Girolami, Mark},
  volume = 	 {22},
  series = 	 {Proceedings of Machine Learning Research},
  address = 	 {La Palma, Canary Islands},
  month = 	 {21--23 Apr},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v22/neu12/neu12.pdf},
  url = 	 {https://proceedings.mlr.press/v22/neu12.html},
  abstract = 	 {We consider online learning in a special class of episodic Markovian decision processes, namely, loop-free stochastic shortest path problems. In this problem, an agent has to traverse through a finite directed acyclic graph with random transitions while maximizing the obtained rewards along the way. We assume that the reward function can change arbitrarily between consecutive episodes, and is entirely revealed to the agent at the end of each episode. Previous work was concerned with the case when the stochastic dynamics is known ahead of time,  whereas the main novelty of this paper is that this assumption is lifted. We propose an algorithm called “follow the perturbed optimistic policy” that combines ideas from the “follow the perturbed leader”  method for online learning of arbitrary sequences and “upper confidence reinforcement learning”, an algorithm for regret minimization in Markovian decision processes (with a fixed reward function). We prove that the expected cumulative regret of our algorithm is of order L X A\sqrtT up to logarithmic factors, where L is the length of the longest path in the graph, \X is the state space, \A is the action space and T is the number of episodes. To our knowledge this is the first algorithm that learns and controls stochastic and adversarial components in an online fashion at the same time.}
}

Endnote

%0 Conference Paper
%T The adversarial stochastic shortest path problem with unknown transition probabilities
%A Gergely Neu
%A Andras Gyorgy
%A Csaba Szepesvari
%B Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics
%C Proceedings of Machine Learning Research
%D 2012
%E Neil D. Lawrence
%E Mark Girolami	
%F pmlr-v22-neu12
%I PMLR
%P 805--813
%U https://proceedings.mlr.press/v22/neu12.html
%V 22
%X We consider online learning in a special class of episodic Markovian decision processes, namely, loop-free stochastic shortest path problems. In this problem, an agent has to traverse through a finite directed acyclic graph with random transitions while maximizing the obtained rewards along the way. We assume that the reward function can change arbitrarily between consecutive episodes, and is entirely revealed to the agent at the end of each episode. Previous work was concerned with the case when the stochastic dynamics is known ahead of time,  whereas the main novelty of this paper is that this assumption is lifted. We propose an algorithm called “follow the perturbed optimistic policy” that combines ideas from the “follow the perturbed leader”  method for online learning of arbitrary sequences and “upper confidence reinforcement learning”, an algorithm for regret minimization in Markovian decision processes (with a fixed reward function). We prove that the expected cumulative regret of our algorithm is of order L X A\sqrtT up to logarithmic factors, where L is the length of the longest path in the graph, \X is the state space, \A is the action space and T is the number of episodes. To our knowledge this is the first algorithm that learns and controls stochastic and adversarial components in an online fashion at the same time.

RIS


TY  - CPAPER
TI  - The adversarial stochastic shortest path problem with unknown transition probabilities
AU  - Gergely Neu
AU  - Andras Gyorgy
AU  - Csaba Szepesvari
BT  - Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics
DA  - 2012/03/21
ED  - Neil D. Lawrence
ED  - Mark Girolami	
ID  - pmlr-v22-neu12
PB  - PMLR
DP  - Proceedings of Machine Learning Research
VL  - 22
SP  - 805
EP  - 813
L1  - http://proceedings.mlr.press/v22/neu12/neu12.pdf
UR  - https://proceedings.mlr.press/v22/neu12.html
AB  - We consider online learning in a special class of episodic Markovian decision processes, namely, loop-free stochastic shortest path problems. In this problem, an agent has to traverse through a finite directed acyclic graph with random transitions while maximizing the obtained rewards along the way. We assume that the reward function can change arbitrarily between consecutive episodes, and is entirely revealed to the agent at the end of each episode. Previous work was concerned with the case when the stochastic dynamics is known ahead of time,  whereas the main novelty of this paper is that this assumption is lifted. We propose an algorithm called “follow the perturbed optimistic policy” that combines ideas from the “follow the perturbed leader”  method for online learning of arbitrary sequences and “upper confidence reinforcement learning”, an algorithm for regret minimization in Markovian decision processes (with a fixed reward function). We prove that the expected cumulative regret of our algorithm is of order L X A\sqrtT up to logarithmic factors, where L is the length of the longest path in the graph, \X is the state space, \A is the action space and T is the number of episodes. To our knowledge this is the first algorithm that learns and controls stochastic and adversarial components in an online fashion at the same time.
ER  -

APA


Neu, G., Gyorgy, A. & Szepesvari, C.. (2012). The adversarial stochastic shortest path problem with unknown transition probabilities. Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 22:805-813 Available from https://proceedings.mlr.press/v22/neu12.html.

The adversarial stochastic shortest path problem with unknown transition probabilities

Abstract

Cite this Paper

Related Material