Learning Exploration/Exploitation Strategies for Single Trajectory Reinforcement Learning

Michael Castronovo; Francis Maes; Raphael Fonteneau; Damien Ernst

Learning Exploration/Exploitation Strategies for Single Trajectory Reinforcement Learning

Michael Castronovo, Francis Maes, Raphael Fonteneau, Damien Ernst

Proceedings of the Tenth European Workshop on Reinforcement Learning, PMLR 24:1-10, 2013.

Abstract

We consider the problem of learning high-performance Exploration/Exploitation (E/E) strategies for finite Markov Decision Processes (MDPs) when the MDP to be controlled is supposed to be drawn from a known probability distribution

$p_\mathcal{M}(\cdot)$ . The performance criterion is the sum of discounted rewards collected by the E/E strategy over an infinite length trajectory. We propose an approach for solving this problem that works by considering a rich set of candidate E/E strategies and by looking for the one that gives the best average performances on MDPs drawn according to

$p_\mathcal{M}(\cdot)$ . As candidate E/E strategies, we consider index-based strategies parametrized by small formulas combining variables that include the estimated reward function, the number of times each transition has occurred and the optimal value functions and the optimal value functions

$\hat{V}$ and

$\hat{Q}$ of the estimated MDP (obtained through value iteration). The search for the best formula is formalized as a multi-armed bandit problem, each arm being associated with a formula. We experimentally compare the performances of the approach with R-max as well as with

$\epsilon$ -Greedy strategies and the results are promising.

Cite this Paper

BibTeX


@InProceedings{pmlr-v24-castronovo12a,
  title = 	 {Learning Exploration/Exploitation Strategies for Single Trajectory Reinforcement Learning},
  author = 	 {Castronovo, Michael and Maes, Francis and Fonteneau, Raphael and Ernst, Damien},
  booktitle = 	 {Proceedings of the Tenth European Workshop on Reinforcement Learning},
  pages = 	 {1--10},
  year = 	 {2013},
  editor = 	 {Deisenroth, Marc Peter and Szepesvári, Csaba and Peters, Jan},
  volume = 	 {24},
  series = 	 {Proceedings of Machine Learning Research},
  address = 	 {Edinburgh, Scotland},
  month = 	 {30 Jun--01 Jul},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v24/castronovo12a/castronovo12a.pdf},
  url = 	 {https://proceedings.mlr.press/v24/castronovo12a.html},
  abstract = 	 {We consider the problem of learning high-performance Exploration/Exploitation (E/E) strategies for finite Markov Decision Processes (MDPs) when the MDP to be controlled is supposed to be drawn from a known probability distribution $p_\mathcal{M}(\cdot)$. The performance criterion is the sum of discounted rewards collected by the E/E strategy over an infinite length trajectory. We propose an approach for solving this problem that works by considering a rich set of candidate E/E strategies and by looking for the one that gives the best average performances on MDPs drawn according to $p_\mathcal{M}(\cdot)$. As candidate E/E strategies, we consider index-based strategies parametrized by small formulas combining variables that include the estimated reward function, the number of times each transition has occurred and the optimal value functions and the optimal value functions $\hat{V}$ and $\hat{Q}$ of the estimated MDP (obtained through value iteration). The search for the best formula is formalized as a multi-armed bandit problem, each arm being associated with a formula. We experimentally compare the performances of the approach with R-max as well as with $\epsilon$-Greedy strategies and the results are promising.}
}

Endnote

%0 Conference Paper
%T Learning Exploration/Exploitation Strategies for Single Trajectory Reinforcement Learning
%A Michael Castronovo
%A Francis Maes
%A Raphael Fonteneau
%A Damien Ernst
%B Proceedings of the Tenth European Workshop on Reinforcement Learning
%C Proceedings of Machine Learning Research
%D 2013
%E Marc Peter Deisenroth
%E Csaba Szepesvári
%E Jan Peters	
%F pmlr-v24-castronovo12a
%I PMLR
%P 1--10
%U https://proceedings.mlr.press/v24/castronovo12a.html
%V 24
%X We consider the problem of learning high-performance Exploration/Exploitation (E/E) strategies for finite Markov Decision Processes (MDPs) when the MDP to be controlled is supposed to be drawn from a known probability distribution $p_\mathcal{M}(\cdot)$. The performance criterion is the sum of discounted rewards collected by the E/E strategy over an infinite length trajectory. We propose an approach for solving this problem that works by considering a rich set of candidate E/E strategies and by looking for the one that gives the best average performances on MDPs drawn according to $p_\mathcal{M}(\cdot)$. As candidate E/E strategies, we consider index-based strategies parametrized by small formulas combining variables that include the estimated reward function, the number of times each transition has occurred and the optimal value functions and the optimal value functions $\hat{V}$ and $\hat{Q}$ of the estimated MDP (obtained through value iteration). The search for the best formula is formalized as a multi-armed bandit problem, each arm being associated with a formula. We experimentally compare the performances of the approach with R-max as well as with $\epsilon$-Greedy strategies and the results are promising.

RIS


TY  - CPAPER
TI  - Learning Exploration/Exploitation Strategies for Single Trajectory Reinforcement Learning
AU  - Michael Castronovo
AU  - Francis Maes
AU  - Raphael Fonteneau
AU  - Damien Ernst
BT  - Proceedings of the Tenth European Workshop on Reinforcement Learning
DA  - 2013/01/12
ED  - Marc Peter Deisenroth
ED  - Csaba Szepesvári
ED  - Jan Peters	
ID  - pmlr-v24-castronovo12a
PB  - PMLR
DP  - Proceedings of Machine Learning Research
VL  - 24
SP  - 1
EP  - 10
L1  - http://proceedings.mlr.press/v24/castronovo12a/castronovo12a.pdf
UR  - https://proceedings.mlr.press/v24/castronovo12a.html
AB  - We consider the problem of learning high-performance Exploration/Exploitation (E/E) strategies for finite Markov Decision Processes (MDPs) when the MDP to be controlled is supposed to be drawn from a known probability distribution $p_\mathcal{M}(\cdot)$. The performance criterion is the sum of discounted rewards collected by the E/E strategy over an infinite length trajectory. We propose an approach for solving this problem that works by considering a rich set of candidate E/E strategies and by looking for the one that gives the best average performances on MDPs drawn according to $p_\mathcal{M}(\cdot)$. As candidate E/E strategies, we consider index-based strategies parametrized by small formulas combining variables that include the estimated reward function, the number of times each transition has occurred and the optimal value functions and the optimal value functions $\hat{V}$ and $\hat{Q}$ of the estimated MDP (obtained through value iteration). The search for the best formula is formalized as a multi-armed bandit problem, each arm being associated with a formula. We experimentally compare the performances of the approach with R-max as well as with $\epsilon$-Greedy strategies and the results are promising.
ER  -

APA


Castronovo, M., Maes, F., Fonteneau, R. & Ernst, D.. (2013). Learning Exploration/Exploitation Strategies for Single Trajectory Reinforcement Learning. Proceedings of the Tenth European Workshop on Reinforcement Learning, in Proceedings of Machine Learning Research 24:1-10 Available from https://proceedings.mlr.press/v24/castronovo12a.html.

Related Material

Download PDF