Proceedings of Machine Learning Research
Proceedings of the Tenth European Workshop on Reinforcement Learning
Held in Edinburgh, Scotland on 30 June to 01 July 2012
Published as Volume 24 by the Proceedings of Machine Learning Research on 12 January 2013.
Volume Edited by:
Marc Peter Deisenroth
Csaba Szepesvári
Jan Peters
Series Editors:
Neil D. Lawrence
http://proceedings.mlr.press/v24/
Sat, 21 Nov 2020 21:17:22 +0000
Sat, 21 Nov 2020 21:17:22 +0000
Jekyll v3.9.0

Rolloutbased Gametree Search Outprunes Traditional Alphabeta
Recently, rolloutbased planning and search methods have emerged as an alternative to traditional treesearch methods. The fundamental operation in rolloutbased tree search is the generation of trajectories in the search tree from root to leaf. Gameplaying programs based on MonteCarlo rollouts methods such as “UCT” have proven remarkably effective at using information from trajectories to make stateoftheart decisions at the root. In this paper, we show that trajectories can be used to prune more aggressively than classical alphabeta search. We modify a rolloutbased method, FSSS, to allow for use in gametree search and show it outprunes alphabeta both empirically and formally.
Sat, 12 Jan 2013 00:00:00 +0000
http://proceedings.mlr.press/v24/weinstein12a.html
http://proceedings.mlr.press/v24/weinstein12a.html

An investigation of imitation learning algorithms for structured prediction
In the imitation learning paradigm algorithms learn from expert demonstrations in order to become able to accomplish a particular task. DaumÃ© III et al. [2009] framed structured prediction in this paradigm and developed the searchbased structured prediction algorithm (Searn) which has been applied successfully to various natural language processing tasks with stateoftheart performance. Recently, Ross et al. [2011] proposed the dataset aggre gation algorithm (DAgger) and compared it with Searn in sequential prediction tasks. In this paper, we compare these two algorithms in the context of a more complex structured prediction task, namely biomedical event extraction. We demonstrate that DAgger has more stable performance and faster learning than Searn, and that these advantages are more pronounced in the parameterfree versions of the algorithms.
Sat, 12 Jan 2013 00:00:00 +0000
http://proceedings.mlr.press/v24/vlachos12a.html
http://proceedings.mlr.press/v24/vlachos12a.html

SemiSupervised Apprenticeship Learning
In apprenticeship learning we aim to learn a good policy by observing the behavior of an expert or a set of experts. In particular, we consider the case where the expert acts so as to maximize an unknown reward function defined as a linear combination of a set of state features. In this paper, we consider the setting where we observe many sample trajectories (i.e., sequences of states) but only one or a few of them are labeled as experts' trajectories. We investigate the conditions under which the remaining unlabeled trajectories can help in learning a policy with a good performance. In particular, we define an extension to the maxmargin inverse reinforcement learning proposed by Abbeel and Ng [2004] where, at each iteration, the maxmargin optimization step is replaced by a semisupervised optimiza tion problem which favors classifiers separating clusters of trajectories. Finally, we report empirical results on two gridworld domains showing that the semisupervised algorithm is able to output a better policy in fewer iterations than the related algorithm that does not take the unlabeled trajectories into account.
Sat, 12 Jan 2013 00:00:00 +0000
http://proceedings.mlr.press/v24/valko12a.html
http://proceedings.mlr.press/v24/valko12a.html

Gradient Temporal Difference Networks
Temporaldifference (TD) networks (Sutton and Tanner, 2004) are a predictive represen tation of state in which each node is an answer to a question about future observations or questions. Unfortunately, existing algorithms for learning TD networks are known to diverge, even in very simple problems. In this paper we present the first sound learning rule for TD networks. Our approach is to develop a true gradient descent algorithm that takes account of all three roles performed by each node in the network: as state, as an answer, and as a target for other questions. Our algorithm combines gradient temporaldifference learning (Maei et al., 2009) with realtime recurrent learning (Williams and Zipser, 1994). We provide a generalisation of the Bellman equation that corresponds to the semantics of the TD network, and prove that our algorithm converges to a fixed point of this equation.
Sat, 12 Jan 2013 00:00:00 +0000
http://proceedings.mlr.press/v24/silver12a.html
http://proceedings.mlr.press/v24/silver12a.html

Evaluation and Analysis of the Performance of the EXP3 Algorithm in Stochastic Environments
EXP3 is a popular algorithm for adversarial multiarmed bandits, suggested and analyzed in this setting by Auer et al. [2002b]. Recently there was an increased interest in the performance of this algorithm in the stochastic setting, due to its new applications to stochastic multiarmed bandits with side information [Seldin et al., 2011] and to multiarmed bandits in the mixed stochasticadversarial setting [Bubeck and Slivkins, 2012]. We present an empirical evaluation and improved analysis of the performance of the EXP3 algorithm in the stochastic setting, as well as a modification of the EXP3 algorithm capable of achieving “logarithmic” regret in stochastic environments.
Sat, 12 Jan 2013 00:00:00 +0000
http://proceedings.mlr.press/v24/seldin12a.html
http://proceedings.mlr.press/v24/seldin12a.html

An Empirical Analysis of Offpolicy Learning in Discrete MDPs
Offpolicy evaluation is the problem of evaluating a decisionmaking policy using data collected under a different behaviour policy. While several methods are available for addressing offpolicy evaluation, little work has been done on identifying the best methods. In this paper, we conduct an indepth comparative study of several offpolicy evaluation methods in nonbandit, finitehorizon MDPs, using randomly generated MDPs, as well as a Mallard population dynamics model [Anderson, 1975] . We find that unnormalized importance sampling can exhibit prohibitively large variance in problems involving lookahead longer than a few time steps, and that dynamic programming methods perform better than MonteCarlo style methods.
Sat, 12 Jan 2013 00:00:00 +0000
http://proceedings.mlr.press/v24/paduraru12a.html
http://proceedings.mlr.press/v24/paduraru12a.html

Online Skill Discovery using Graphbased Clustering
We introduce a new online skill discovery method for reinforcement learning in discrete domains. The method is based on the bottleneck principle and identifies skills using a bottomup hierarchical clustering of the estimated transition graph. In contrast to prior clustering approaches, it can be used incrementally and thus several times during the learning process. Our empirical evaluation shows that “assuming dense local connectivity in the face of uncertainty” can prevent premature identification of skills. Furthermore, we show that the choice of the linkage criterion is crucial for dealing with nonrandom sampling policies and stochastic environments.
Sat, 12 Jan 2013 00:00:00 +0000
http://proceedings.mlr.press/v24/metzen12a.html
http://proceedings.mlr.press/v24/metzen12a.html

Directed Exploration in Reinforcement Learning with Transferred Knowledge
Experimental results suggest that transfer learning (TL), compared to learning from scratch, can decrease exploration by reinforcement learning (RL) algorithms. Most existing TL algorithms for RL are heuristic and may result in worse performance than learning from scratch (i.e., negative transfer). We introduce a theoretically grounded and flexible approach that transfers actionvalues via an intertask mapping and, based on those, explores the target task systematically. We characterize positive transfer as (1) decreasing sample complexity in the target task compared to the sample complexity of the base RL algorithm (without transferred actionvalues) and (2) guaranteeing that the algorithm converges to a nearoptimal policy (i.e., negligible optimality loss). The sample complexity of our approach is no worse than the base algorithm's, and our analysis reveals that positive transfer can occur even with highly inaccurate and partial intertask mappings. Finally, we empirically test directed exploration with transfer in a multijoint reaching task, which highlights the value of our analysis and the robustness of our approach under imperfect conditions.
Sat, 12 Jan 2013 00:00:00 +0000
http://proceedings.mlr.press/v24/mann12a.html
http://proceedings.mlr.press/v24/mann12a.html

ActorCritic Reinforcement Learning with EnergyBased Policies
We consider reinforcement learning in Markov decision processes with high dimensional state and action spaces. We parametrize policies using energybased models (particularly restricted Boltzmann machines), and train them using policy gradient learning. Our approach builds upon Sallans and Hinton (2004), who parameterized value functions using energybased models, trained using a nonlinear variant of temporaldifference (TD) learning. Unfortunately, nonlinear TD is known to diverge in theory and practice. We introduce the first sound and efficient algorithm for training energybased policies, based on an actorcritic architecture. Our algorithm is computationally efficient, converges close to a local optimum, and outperforms Sallans and Hinton (2004) in several high dimensional domains.
Sat, 12 Jan 2013 00:00:00 +0000
http://proceedings.mlr.press/v24/heess12a.html
http://proceedings.mlr.press/v24/heess12a.html

Planning in RewardRich Domains via PAC Bandits
In some decisionmaking environments, successful solutions are common. If the evaluation of candidate solutions is noisy, however, the challenge is knowing when a “good enough” answer has been found. We formalize this problem as an infinitearmed bandit and provide upper and lower bounds on the number of evaluations or “pulls” needed to identify a solution whose evaluation exceeds a given threshold r0 . We present several algorithms and use them to identify reliable strategies for solving screens from the video games \emphInfinite Mario and \emphPitfall! We show order of magnitude improvements in sample complexity over a natural approach that pulls each arm until a good estimate of its success probability is known.
Sat, 12 Jan 2013 00:00:00 +0000
http://proceedings.mlr.press/v24/goschin12a.html
http://proceedings.mlr.press/v24/goschin12a.html

Preface
Preface to the Proceedings of the Tenth European Workshop on Reinforcement Learning June, 2012, Edinburgh, Scotland.
Sat, 12 Jan 2013 00:00:00 +0000
http://proceedings.mlr.press/v24/deisenroth12a.html
http://proceedings.mlr.press/v24/deisenroth12a.html

Feature Reinforcement Learning using Looping Suffix Trees
There has recently been much interest in historybased methods using suffix trees to solve POMDPs. However, these suffix trees cannot efficiently represent environments that have longterm dependencies. We extend the recently introduced CTÎ¦MDP algorithm to the space of looping suffix trees which have previously only been used in solving deterministic POMDPs. The resulting algorithm replicates results from CTÎ¦MDP for environments with short term dependencies, while it outperforms LSTMbased methods on TMaze, a deep memory environment.
Sat, 12 Jan 2013 00:00:00 +0000
http://proceedings.mlr.press/v24/daswani12a.html
http://proceedings.mlr.press/v24/daswani12a.html

Learning Exploration/Exploitation Strategies for Single Trajectory Reinforcement Learning
We consider the problem of learning highperformance Exploration/Exploitation (E/E) strategies for finite Markov Decision Processes (MDPs) when the MDP to be controlled is supposed to be drawn from a known probability distribution $p_\mathcal{M}(\cdot)$. The performance criterion is the sum of discounted rewards collected by the E/E strategy over an infinite length trajectory. We propose an approach for solving this problem that works by considering a rich set of candidate E/E strategies and by looking for the one that gives the best average performances on MDPs drawn according to $p_\mathcal{M}(\cdot)$. As candidate E/E strategies, we consider indexbased strategies parametrized by small formulas combining variables that include the estimated reward function, the number of times each transition has occurred and the optimal value functions and the optimal value functions $\hat{V}$ and $\hat{Q}$ of the estimated MDP (obtained through value iteration). The search for the best formula is formalized as a multiarmed bandit problem, each arm being associated with a formula. We experimentally compare the performances of the approach with Rmax as well as with $\epsilon$Greedy strategies and the results are promising.
Sat, 12 Jan 2013 00:00:00 +0000
http://proceedings.mlr.press/v24/castronovo12a.html
http://proceedings.mlr.press/v24/castronovo12a.html