- title: 'Preface'
  abstract: 'Preface to the Proceedings of the Tenth European Workshop on Reinforcement Learning June, 2012, Edinburgh, Scotland.'
  volume: 24
  URL: https://proceedings.mlr.press/v24/deisenroth12a.html
  PDF: http://proceedings.mlr.press/v24/deisenroth12a/deisenroth12a.pdf
  edit: https://github.com/mlresearch//v24/edit/gh-pages/_posts/2013-01-12-deisenroth12a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Tenth European Workshop on Reinforcement Learning'
  publisher: 'PMLR'
  author: 
  - given: Marc Peter
    family: Deisenroth
  - given: Csaba
    family: Szepesvári
  - given: Jan
    family: Peters
  editor: 
  - given: Marc Peter
    family: Deisenroth
  - given: Csaba
    family: Szepesvári
  - given: Jan
    family: Peters
  address: Edinburgh, Scotland
  page: i-i
  id: deisenroth12a
  issued:
    date-parts: 
      - 2013
      - 1
      - 12
  firstpage: i
  lastpage: i
  published: 2013-01-12 00:00:00 +0000
- title: 'Learning Exploration/Exploitation Strategies for Single Trajectory Reinforcement Learning'
  abstract: 'We consider the problem of learning high-performance Exploration/Exploitation (E/E) strategies for finite Markov Decision Processes (MDPs) when the MDP to be controlled is supposed to be drawn from a known probability distribution $p_\mathcal{M}(\cdot)$. The performance criterion is the sum of discounted rewards collected by the E/E strategy over an infinite length trajectory. We propose an approach for solving this problem that works by considering a rich set of candidate E/E strategies and by looking for the one that gives the best average performances on MDPs drawn according to $p_\mathcal{M}(\cdot)$. As candidate E/E strategies, we consider index-based strategies parametrized by small formulas combining variables that include the estimated reward function, the number of times each transition has occurred and the optimal value functions and the optimal value functions $\hat{V}$ and $\hat{Q}$ of the estimated MDP (obtained through value iteration). The search for the best formula is formalized as a multi-armed bandit problem, each arm being associated with a formula. We experimentally compare the performances of the approach with R-max as well as with $\epsilon$-Greedy strategies and the results are promising.'
  volume: 24
  URL: https://proceedings.mlr.press/v24/castronovo12a.html
  PDF: http://proceedings.mlr.press/v24/castronovo12a/castronovo12a.pdf
  edit: https://github.com/mlresearch//v24/edit/gh-pages/_posts/2013-01-12-castronovo12a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Tenth European Workshop on Reinforcement Learning'
  publisher: 'PMLR'
  author: 
  - given: Michael
    family: Castronovo
  - given: Francis
    family: Maes
  - given: Raphael
    family: Fonteneau
  - given: Damien
    family: Ernst
  editor: 
  - given: Marc Peter
    family: Deisenroth
  - given: Csaba
    family: Szepesvári
  - given: Jan
    family: Peters
  address: Edinburgh, Scotland
  page: 1-10
  id: castronovo12a
  issued:
    date-parts: 
      - 2013
      - 1
      - 12
  firstpage: 1
  lastpage: 10
  published: 2013-01-12 00:00:00 +0000
- title: 'Feature Reinforcement Learning using Looping Suffix Trees'
  abstract: 'There has recently been much interest in history-based methods using suffix trees to solve POMDPs. However, these suffix trees cannot efficiently represent environments that have long-term dependencies. We extend the recently introduced CTÎ¦MDP algorithm to the space of looping suffix trees which have previously only been used in solving deterministic POMDPs. The resulting algorithm replicates results from CTÎ¦MDP for environments with short term dependencies, while it outperforms LSTM-based methods on TMaze, a deep memory environment.'
  volume: 24
  URL: https://proceedings.mlr.press/v24/daswani12a.html
  PDF: http://proceedings.mlr.press/v24/daswani12a/daswani12a.pdf
  edit: https://github.com/mlresearch//v24/edit/gh-pages/_posts/2013-01-12-daswani12a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Tenth European Workshop on Reinforcement Learning'
  publisher: 'PMLR'
  author: 
  - given: Mayank
    family: Daswani
  - given: Peter
    family: Sunehag
  - given: Marcus
    family: Hutter
  editor: 
  - given: Marc Peter
    family: Deisenroth
  - given: Csaba
    family: Szepesvári
  - given: Jan
    family: Peters
  address: Edinburgh, Scotland
  page: 11-24
  id: daswani12a
  issued:
    date-parts: 
      - 2013
      - 1
      - 12
  firstpage: 11
  lastpage: 24
  published: 2013-01-12 00:00:00 +0000
- title: 'Planning in Reward-Rich Domains via PAC Bandits'
  abstract: 'In some decision-making environments, successful solutions are common. If the evaluation of candidate solutions is noisy, however, the challenge is knowing when a “good enough” answer has been found. We formalize this problem as an infinite-armed bandit and provide upper and lower bounds on the number of evaluations or “pulls” needed to identify a solution whose evaluation exceeds a given threshold r0 . We present several algorithms and use them to identify reliable strategies for solving screens from the video games \emphInfinite Mario and \emphPitfall! We show order of magnitude improvements in sample complexity over a natural approach that pulls each arm until a good estimate of its success probability is known.'
  volume: 24
  URL: https://proceedings.mlr.press/v24/goschin12a.html
  PDF: http://proceedings.mlr.press/v24/goschin12a/goschin12a.pdf
  edit: https://github.com/mlresearch//v24/edit/gh-pages/_posts/2013-01-12-goschin12a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Tenth European Workshop on Reinforcement Learning'
  publisher: 'PMLR'
  author: 
  - given: Sergiu
    family: Goschin
  - given: Ari
    family: Weinstein
  - given: Michael L.
    family: Littman
  - given: Erick
    family: Chastain
  editor: 
  - given: Marc Peter
    family: Deisenroth
  - given: Csaba
    family: Szepesvári
  - given: Jan
    family: Peters
  address: Edinburgh, Scotland
  page: 25-42
  id: goschin12a
  issued:
    date-parts: 
      - 2013
      - 1
      - 12
  firstpage: 25
  lastpage: 42
  published: 2013-01-12 00:00:00 +0000
- title: 'Actor-Critic Reinforcement Learning with Energy-Based Policies'
  abstract: 'We consider reinforcement learning in Markov decision processes with high dimensional state and action spaces. We parametrize policies using energy-based models (particularly restricted Boltzmann machines), and train them using policy gradient learning. Our approach builds upon Sallans and Hinton (2004), who parameterized value functions using energy-based models, trained using a non-linear variant of temporal-difference (TD) learning. Unfortunately, non-linear TD is known to diverge in theory and practice. We introduce the first sound and efficient algorithm for training energy-based policies, based on an actor-critic architecture. Our algorithm is computationally efficient, converges close to a local optimum, and outperforms Sallans and Hinton (2004) in several high dimensional domains.'
  volume: 24
  URL: https://proceedings.mlr.press/v24/heess12a.html
  PDF: http://proceedings.mlr.press/v24/heess12a/heess12a.pdf
  edit: https://github.com/mlresearch//v24/edit/gh-pages/_posts/2013-01-12-heess12a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Tenth European Workshop on Reinforcement Learning'
  publisher: 'PMLR'
  author: 
  - given: Nicolas
    family: Heess
  - given: David
    family: Silver
  - given: Yee Whye
    family: Teh
  editor: 
  - given: Marc Peter
    family: Deisenroth
  - given: Csaba
    family: Szepesvári
  - given: Jan
    family: Peters
  address: Edinburgh, Scotland
  page: 45-58
  id: heess12a
  issued:
    date-parts: 
      - 2013
      - 1
      - 12
  firstpage: 45
  lastpage: 58
  published: 2013-01-12 00:00:00 +0000
- title: 'Directed Exploration in Reinforcement Learning with Transferred Knowledge'
  abstract: 'Experimental results suggest that transfer learning (TL), compared to learning from scratch, can decrease exploration by reinforcement learning (RL) algorithms. Most existing TL algorithms for RL are heuristic and may result in worse performance than learning from scratch (i.e., negative transfer). We introduce a theoretically grounded and flexible approach that transfers action-values via an intertask mapping and, based on those, explores the target task systematically. We characterize positive transfer as (1) decreasing sample complexity in the target task compared to the sample complexity of the base RL algorithm (without transferred action-values) and (2) guaranteeing that the algorithm converges to a near-optimal policy (i.e., negligible optimality loss). The sample complexity of our approach is no worse than the base algorithm''s, and our analysis reveals that positive transfer can occur even with highly inaccurate and partial intertask mappings. Finally, we empirically test directed exploration with transfer in a multijoint reaching task, which highlights the value of our analysis and the robustness of our approach under imperfect conditions.'
  volume: 24
  URL: https://proceedings.mlr.press/v24/mann12a.html
  PDF: http://proceedings.mlr.press/v24/mann12a/mann12a.pdf
  edit: https://github.com/mlresearch//v24/edit/gh-pages/_posts/2013-01-12-mann12a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Tenth European Workshop on Reinforcement Learning'
  publisher: 'PMLR'
  author: 
  - given: Timothy A.
    family: Mann
  - given: Yoonsuck
    family: Choe
  editor: 
  - given: Marc Peter
    family: Deisenroth
  - given: Csaba
    family: Szepesvári
  - given: Jan
    family: Peters
  address: Edinburgh, Scotland
  page: 59-76
  id: mann12a
  issued:
    date-parts: 
      - 2013
      - 1
      - 12
  firstpage: 59
  lastpage: 76
  published: 2013-01-12 00:00:00 +0000
- title: 'Online Skill Discovery using Graph-based Clustering'
  abstract: 'We introduce a new online skill discovery method for reinforcement learning in discrete domains. The method is based on the bottleneck principle and identifies skills using a bottom-up hierarchical clustering of the estimated transition graph. In contrast to prior clustering approaches, it can be used incrementally and thus several times during the learning process. Our empirical evaluation shows that “assuming dense local connectivity in the face of uncertainty” can prevent premature identification of skills. Furthermore, we show that the choice of the linkage criterion is crucial for dealing with non-random sampling policies and stochastic environments.'
  volume: 24
  URL: https://proceedings.mlr.press/v24/metzen12a.html
  PDF: http://proceedings.mlr.press/v24/metzen12a/metzen12a.pdf
  edit: https://github.com/mlresearch//v24/edit/gh-pages/_posts/2013-01-12-metzen12a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Tenth European Workshop on Reinforcement Learning'
  publisher: 'PMLR'
  author: 
  - given: Jan Hendrik
    family: Metzen
  editor: 
  - given: Marc Peter
    family: Deisenroth
  - given: Csaba
    family: Szepesvári
  - given: Jan
    family: Peters
  address: Edinburgh, Scotland
  page: 77-88
  id: metzen12a
  issued:
    date-parts: 
      - 2013
      - 1
      - 12
  firstpage: 77
  lastpage: 88
  published: 2013-01-12 00:00:00 +0000
- title: 'An Empirical Analysis of Off-policy Learning in Discrete MDPs'
  abstract: 'Off-policy evaluation is the problem of evaluating a decision-making policy using data collected under a different behaviour policy. While several methods are available for addressing off-policy evaluation, little work has been done on identifying the best methods. In this paper, we conduct an in-depth comparative study of several off-policy evaluation methods in non-bandit, finite-horizon MDPs, using randomly generated MDPs, as well as a Mallard population dynamics model [Anderson, 1975] . We find that un-normalized importance sampling can exhibit prohibitively large variance in problems involving look-ahead longer than a few time steps, and that dynamic programming methods perform better than Monte-Carlo style methods.'
  volume: 24
  URL: https://proceedings.mlr.press/v24/paduraru12a.html
  PDF: http://proceedings.mlr.press/v24/paduraru12a/paduraru12a.pdf
  edit: https://github.com/mlresearch//v24/edit/gh-pages/_posts/2013-01-12-paduraru12a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Tenth European Workshop on Reinforcement Learning'
  publisher: 'PMLR'
  author: 
  - given: Cosmin
    family: Păduraru
  - given: Doina
    family: Precup
  - given: Joelle
    family: Pineau
  - given: Gheorghe
    family: Comănici
  editor: 
  - given: Marc Peter
    family: Deisenroth
  - given: Csaba
    family: Szepesvári
  - given: Jan
    family: Peters
  address: Edinburgh, Scotland
  page: 89-102
  id: paduraru12a
  issued:
    date-parts: 
      - 2013
      - 1
      - 12
  firstpage: 89
  lastpage: 102
  published: 2013-01-12 00:00:00 +0000
- title: 'Evaluation and Analysis of the Performance of the EXP3 Algorithm in Stochastic Environments'
  abstract: 'EXP3 is a popular algorithm for adversarial multiarmed bandits, suggested and analyzed in this setting by Auer et al. [2002b]. Recently there was an increased interest in the performance of this algorithm in the stochastic setting, due to its new applications to stochastic multiarmed bandits with side information [Seldin et al., 2011] and to multiarmed bandits in the mixed stochastic-adversarial setting [Bubeck and Slivkins, 2012]. We present an empirical evaluation and improved analysis of the performance of the EXP3 algorithm in the stochastic setting, as well as a modification of the EXP3 algorithm capable of achieving “logarithmic” regret in stochastic environments.'
  volume: 24
  URL: https://proceedings.mlr.press/v24/seldin12a.html
  PDF: http://proceedings.mlr.press/v24/seldin12a/seldin12a.pdf
  edit: https://github.com/mlresearch//v24/edit/gh-pages/_posts/2013-01-12-seldin12a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Tenth European Workshop on Reinforcement Learning'
  publisher: 'PMLR'
  author: 
  - given: Yevgeny
    family: Seldin
  - given: Csaba
    family: Szepesvári
  - given: Peter
    family: Auer
  - given: Yasin
    family: Abbasi-Yadkori
  editor: 
  - given: Marc Peter
    family: Deisenroth
  - given: Csaba
    family: Szepesvári
  - given: Jan
    family: Peters
  address: Edinburgh, Scotland
  page: 103-116
  id: seldin12a
  issued:
    date-parts: 
      - 2013
      - 1
      - 12
  firstpage: 103
  lastpage: 116
  published: 2013-01-12 00:00:00 +0000
- title: 'Gradient Temporal Difference Networks'
  abstract: 'Temporal-difference (TD) networks (Sutton and Tanner, 2004) are a predictive represen- tation of state in which each node is an answer to a question about future observations or questions. Unfortunately, existing algorithms for learning TD networks are known to diverge, even in very simple problems. In this paper we present the first sound learning rule for TD networks. Our approach is to develop a true gradient descent algorithm that takes account of all three roles performed by each node in the network: as state, as an answer, and as a target for other questions. Our algorithm combines gradient temporal-difference learning (Maei et al., 2009) with real-time recurrent learning (Williams and Zipser, 1994). We provide a generalisation of the Bellman equation that corresponds to the semantics of the TD network, and prove that our algorithm converges to a fixed point of this equation.'
  volume: 24
  URL: https://proceedings.mlr.press/v24/silver12a.html
  PDF: http://proceedings.mlr.press/v24/silver12a/silver12a.pdf
  edit: https://github.com/mlresearch//v24/edit/gh-pages/_posts/2013-01-12-silver12a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Tenth European Workshop on Reinforcement Learning'
  publisher: 'PMLR'
  author: 
  - given: David
    family: Silver
  editor: 
  - given: Marc Peter
    family: Deisenroth
  - given: Csaba
    family: Szepesvári
  - given: Jan
    family: Peters
  address: Edinburgh, Scotland
  page: 117-130
  id: silver12a
  issued:
    date-parts: 
      - 2013
      - 1
      - 12
  firstpage: 117
  lastpage: 130
  published: 2013-01-12 00:00:00 +0000
- title: 'Semi-Supervised Apprenticeship Learning'
  abstract: 'In apprenticeship learning we aim to learn a good policy by observing the behavior of an expert or a set of experts. In particular, we consider the case where the expert acts so as to maximize an unknown reward function defined as a linear combination of a set of state features. In this paper, we consider the setting where we observe many sample trajectories (i.e., sequences of states) but only one or a few of them are labeled as experts'' trajectories. We investigate the conditions under which the remaining unlabeled trajectories can help in learning a policy with a good performance. In particular, we define an extension to the max-margin inverse reinforcement learning proposed by Abbeel and Ng [2004] where, at each iteration, the max-margin optimization step is replaced by a semi-supervised optimiza- tion problem which favors classifiers separating clusters of trajectories. Finally, we report empirical results on two grid-world domains showing that the semi-supervised algorithm is able to output a better policy in fewer iterations than the related algorithm that does not take the unlabeled trajectories into account.'
  volume: 24
  URL: https://proceedings.mlr.press/v24/valko12a.html
  PDF: http://proceedings.mlr.press/v24/valko12a/valko12a.pdf
  edit: https://github.com/mlresearch//v24/edit/gh-pages/_posts/2013-01-12-valko12a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Tenth European Workshop on Reinforcement Learning'
  publisher: 'PMLR'
  author: 
  - given: Michal
    family: Valko
  - given: Mohammad
    family: Ghavamzadeh
  - given: Alessandro
    family: Lazaric
  editor: 
  - given: Marc Peter
    family: Deisenroth
  - given: Csaba
    family: Szepesvári
  - given: Jan
    family: Peters
  address: Edinburgh, Scotland
  page: 131-142
  id: valko12a
  issued:
    date-parts: 
      - 2013
      - 1
      - 12
  firstpage: 131
  lastpage: 142
  published: 2013-01-12 00:00:00 +0000
- title: 'An investigation of imitation learning algorithms for structured prediction'
  abstract: 'In the imitation learning paradigm algorithms learn from expert demonstrations in order to become able to accomplish a particular task. DaumÃ© III et al. [2009] framed structured prediction in this paradigm and developed the search-based structured prediction algorithm (Searn) which has been applied successfully to various natural language processing tasks with state-of-the-art performance. Recently, Ross et al. [2011] proposed the dataset aggre- gation algorithm (DAgger) and compared it with Searn in sequential prediction tasks. In this paper, we compare these two algorithms in the context of a more complex structured prediction task, namely biomedical event extraction. We demonstrate that DAgger has more stable performance and faster learning than Searn, and that these advantages are more pronounced in the parameter-free versions of the algorithms.'
  volume: 24
  URL: https://proceedings.mlr.press/v24/vlachos12a.html
  PDF: http://proceedings.mlr.press/v24/vlachos12a/vlachos12a.pdf
  edit: https://github.com/mlresearch//v24/edit/gh-pages/_posts/2013-01-12-vlachos12a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Tenth European Workshop on Reinforcement Learning'
  publisher: 'PMLR'
  author: 
  - given: Andreas
    family: Vlachos
  editor: 
  - given: Marc Peter
    family: Deisenroth
  - given: Csaba
    family: Szepesvári
  - given: Jan
    family: Peters
  address: Edinburgh, Scotland
  page: 143-154
  id: vlachos12a
  issued:
    date-parts: 
      - 2013
      - 1
      - 12
  firstpage: 143
  lastpage: 154
  published: 2013-01-12 00:00:00 +0000
- title: 'Rollout-based Game-tree Search Outprunes Traditional Alpha-beta'
  abstract: 'Recently, rollout-based planning and search methods have emerged as an alternative to traditional tree-search methods. The fundamental operation in rollout-based tree search is the generation of trajectories in the search tree from root to leaf. Game-playing programs based on Monte-Carlo rollouts methods such as “UCT” have proven remarkably effective at using information from trajectories to make state-of-the-art decisions at the root. In this paper, we show that trajectories can be used to prune more aggressively than classical alpha-beta search. We modify a rollout-based method, FSSS, to allow for use in game-tree search and show it outprunes alpha-beta both empirically and formally.'
  volume: 24
  URL: https://proceedings.mlr.press/v24/weinstein12a.html
  PDF: http://proceedings.mlr.press/v24/weinstein12a/weinstein12a.pdf
  edit: https://github.com/mlresearch//v24/edit/gh-pages/_posts/2013-01-12-weinstein12a.md
  series: 'Proceedings of Machine Learning Research'
  container-title: 'Proceedings of the Tenth European Workshop on Reinforcement Learning'
  publisher: 'PMLR'
  author: 
  - given: Ari
    family: Weinstein
  - given: Michael L.
    family: Littman
  - given: Sergiu
    family: Goschin
  editor: 
  - given: Marc Peter
    family: Deisenroth
  - given: Csaba
    family: Szepesvári
  - given: Jan
    family: Peters
  address: Edinburgh, Scotland
  page: 155-167
  id: weinstein12a
  issued:
    date-parts: 
      - 2013
      - 1
      - 12
  firstpage: 155
  lastpage: 167
  published: 2013-01-12 00:00:00 +0000