Proceedings of Machine Learning Research

Proceedings of Machine Learning Research Proceedings of the Tenth European Workshop on Reinforcement Learning Held in Edinburgh, Scotland on 30 June to 01 July 2012 Published as Volume 24 by the Proceedings of Machine Learning Research on 12 January 2013. Volume Edited by: Marc Peter Deisenroth Csaba Szepesvári Jan Peters Series Editors: Neil D. Lawrence https://proceedings.mlr.press/v24/ Wed, 08 Feb 2023 10:39:08 +0000 Wed, 08 Feb 2023 10:39:08 +0000 Jekyll v3.9.3 Rollout-based Game-tree Search Outprunes Traditional Alpha-beta Recently, rollout-based planning and search methods have emerged as an alternative to traditional tree-search methods. The fundamental operation in rollout-based tree search is the generation of trajectories in the search tree from root to leaf. Game-playing programs based on Monte-Carlo rollouts methods such as “UCT” have proven remarkably effective at using information from trajectories to make state-of-the-art decisions at the root. In this paper, we show that trajectories can be used to prune more aggressively than classical alpha-beta search. We modify a rollout-based method, FSSS, to allow for use in game-tree search and show it outprunes alpha-beta both empirically and formally. Sat, 12 Jan 2013 00:00:00 +0000 https://proceedings.mlr.press/v24/weinstein12a.html https://proceedings.mlr.press/v24/weinstein12a.html An investigation of imitation learning algorithms for structured prediction In the imitation learning paradigm algorithms learn from expert demonstrations in order to become able to accomplish a particular task. DaumÃ© III et al. [2009] framed structured prediction in this paradigm and developed the search-based structured prediction algorithm (Searn) which has been applied successfully to various natural language processing tasks with state-of-the-art performance. Recently, Ross et al. [2011] proposed the dataset aggre- gation algorithm (DAgger) and compared it with Searn in sequential prediction tasks. In this paper, we compare these two algorithms in the context of a more complex structured prediction task, namely biomedical event extraction. We demonstrate that DAgger has more stable performance and faster learning than Searn, and that these advantages are more pronounced in the parameter-free versions of the algorithms. Sat, 12 Jan 2013 00:00:00 +0000 https://proceedings.mlr.press/v24/vlachos12a.html https://proceedings.mlr.press/v24/vlachos12a.html Semi-Supervised Apprenticeship Learning In apprenticeship learning we aim to learn a good policy by observing the behavior of an expert or a set of experts. In particular, we consider the case where the expert acts so as to maximize an unknown reward function defined as a linear combination of a set of state features. In this paper, we consider the setting where we observe many sample trajectories (i.e., sequences of states) but only one or a few of them are labeled as experts' trajectories. We investigate the conditions under which the remaining unlabeled trajectories can help in learning a policy with a good performance. In particular, we define an extension to the max-margin inverse reinforcement learning proposed by Abbeel and Ng [2004] where, at each iteration, the max-margin optimization step is replaced by a semi-supervised optimiza- tion problem which favors classifiers separating clusters of trajectories. Finally, we report empirical results on two grid-world domains showing that the semi-supervised algorithm is able to output a better policy in fewer iterations than the related algorithm that does not take the unlabeled trajectories into account. Sat, 12 Jan 2013 00:00:00 +0000 https://proceedings.mlr.press/v24/valko12a.html https://proceedings.mlr.press/v24/valko12a.html Gradient Temporal Difference Networks Temporal-difference (TD) networks (Sutton and Tanner, 2004) are a predictive represen- tation of state in which each node is an answer to a question about future observations or questions. Unfortunately, existing algorithms for learning TD networks are known to diverge, even in very simple problems. In this paper we present the first sound learning rule for TD networks. Our approach is to develop a true gradient descent algorithm that takes account of all three roles performed by each node in the network: as state, as an answer, and as a target for other questions. Our algorithm combines gradient temporal-difference learning (Maei et al., 2009) with real-time recurrent learning (Williams and Zipser, 1994). We provide a generalisation of the Bellman equation that corresponds to the semantics of the TD network, and prove that our algorithm converges to a fixed point of this equation. Sat, 12 Jan 2013 00:00:00 +0000 https://proceedings.mlr.press/v24/silver12a.html https://proceedings.mlr.press/v24/silver12a.html Evaluation and Analysis of the Performance of the EXP3 Algorithm in Stochastic Environments EXP3 is a popular algorithm for adversarial multiarmed bandits, suggested and analyzed in this setting by Auer et al. [2002b]. Recently there was an increased interest in the performance of this algorithm in the stochastic setting, due to its new applications to stochastic multiarmed bandits with side information [Seldin et al., 2011] and to multiarmed bandits in the mixed stochastic-adversarial setting [Bubeck and Slivkins, 2012]. We present an empirical evaluation and improved analysis of the performance of the EXP3 algorithm in the stochastic setting, as well as a modification of the EXP3 algorithm capable of achieving “logarithmic” regret in stochastic environments. Sat, 12 Jan 2013 00:00:00 +0000 https://proceedings.mlr.press/v24/seldin12a.html https://proceedings.mlr.press/v24/seldin12a.html An Empirical Analysis of Off-policy Learning in Discrete MDPs Off-policy evaluation is the problem of evaluating a decision-making policy using data collected under a different behaviour policy. While several methods are available for addressing off-policy evaluation, little work has been done on identifying the best methods. In this paper, we conduct an in-depth comparative study of several off-policy evaluation methods in non-bandit, finite-horizon MDPs, using randomly generated MDPs, as well as a Mallard population dynamics model [Anderson, 1975] . We find that un-normalized importance sampling can exhibit prohibitively large variance in problems involving look-ahead longer than a few time steps, and that dynamic programming methods perform better than Monte-Carlo style methods. Sat, 12 Jan 2013 00:00:00 +0000 https://proceedings.mlr.press/v24/paduraru12a.html https://proceedings.mlr.press/v24/paduraru12a.html Online Skill Discovery using Graph-based Clustering We introduce a new online skill discovery method for reinforcement learning in discrete domains. The method is based on the bottleneck principle and identifies skills using a bottom-up hierarchical clustering of the estimated transition graph. In contrast to prior clustering approaches, it can be used incrementally and thus several times during the learning process. Our empirical evaluation shows that “assuming dense local connectivity in the face of uncertainty” can prevent premature identification of skills. Furthermore, we show that the choice of the linkage criterion is crucial for dealing with non-random sampling policies and stochastic environments. Sat, 12 Jan 2013 00:00:00 +0000 https://proceedings.mlr.press/v24/metzen12a.html https://proceedings.mlr.press/v24/metzen12a.html Directed Exploration in Reinforcement Learning with Transferred Knowledge Experimental results suggest that transfer learning (TL), compared to learning from scratch, can decrease exploration by reinforcement learning (RL) algorithms. Most existing TL algorithms for RL are heuristic and may result in worse performance than learning from scratch (i.e., negative transfer). We introduce a theoretically grounded and flexible approach that transfers action-values via an intertask mapping and, based on those, explores the target task systematically. We characterize positive transfer as (1) decreasing sample complexity in the target task compared to the sample complexity of the base RL algorithm (without transferred action-values) and (2) guaranteeing that the algorithm converges to a near-optimal policy (i.e., negligible optimality loss). The sample complexity of our approach is no worse than the base algorithm's, and our analysis reveals that positive transfer can occur even with highly inaccurate and partial intertask mappings. Finally, we empirically test directed exploration with transfer in a multijoint reaching task, which highlights the value of our analysis and the robustness of our approach under imperfect conditions. Sat, 12 Jan 2013 00:00:00 +0000 https://proceedings.mlr.press/v24/mann12a.html https://proceedings.mlr.press/v24/mann12a.html Actor-Critic Reinforcement Learning with Energy-Based Policies We consider reinforcement learning in Markov decision processes with high dimensional state and action spaces. We parametrize policies using energy-based models (particularly restricted Boltzmann machines), and train them using policy gradient learning. Our approach builds upon Sallans and Hinton (2004), who parameterized value functions using energy-based models, trained using a non-linear variant of temporal-difference (TD) learning. Unfortunately, non-linear TD is known to diverge in theory and practice. We introduce the first sound and efficient algorithm for training energy-based policies, based on an actor-critic architecture. Our algorithm is computationally efficient, converges close to a local optimum, and outperforms Sallans and Hinton (2004) in several high dimensional domains. Sat, 12 Jan 2013 00:00:00 +0000 https://proceedings.mlr.press/v24/heess12a.html https://proceedings.mlr.press/v24/heess12a.html Planning in Reward-Rich Domains via PAC Bandits In some decision-making environments, successful solutions are common. If the evaluation of candidate solutions is noisy, however, the challenge is knowing when a “good enough” answer has been found. We formalize this problem as an infinite-armed bandit and provide upper and lower bounds on the number of evaluations or “pulls” needed to identify a solution whose evaluation exceeds a given threshold r0 . We present several algorithms and use them to identify reliable strategies for solving screens from the video games \emphInfinite Mario and \emphPitfall! We show order of magnitude improvements in sample complexity over a natural approach that pulls each arm until a good estimate of its success probability is known. Sat, 12 Jan 2013 00:00:00 +0000 https://proceedings.mlr.press/v24/goschin12a.html https://proceedings.mlr.press/v24/goschin12a.html Preface Preface to the Proceedings of the Tenth European Workshop on Reinforcement Learning June, 2012, Edinburgh, Scotland. Sat, 12 Jan 2013 00:00:00 +0000 https://proceedings.mlr.press/v24/deisenroth12a.html https://proceedings.mlr.press/v24/deisenroth12a.html Feature Reinforcement Learning using Looping Suffix Trees There has recently been much interest in history-based methods using suffix trees to solve POMDPs. However, these suffix trees cannot efficiently represent environments that have long-term dependencies. We extend the recently introduced CTÎ¦MDP algorithm to the space of looping suffix trees which have previously only been used in solving deterministic POMDPs. The resulting algorithm replicates results from CTÎ¦MDP for environments with short term dependencies, while it outperforms LSTM-based methods on TMaze, a deep memory environment. Sat, 12 Jan 2013 00:00:00 +0000 https://proceedings.mlr.press/v24/daswani12a.html https://proceedings.mlr.press/v24/daswani12a.html Learning Exploration/Exploitation Strategies for Single Trajectory Reinforcement Learning We consider the problem of learning high-performance Exploration/Exploitation (E/E) strategies for finite Markov Decision Processes (MDPs) when the MDP to be controlled is supposed to be drawn from a known probability distribution $p_\mathcal{M}(\cdot)$. The performance criterion is the sum of discounted rewards collected by the E/E strategy over an infinite length trajectory. We propose an approach for solving this problem that works by considering a rich set of candidate E/E strategies and by looking for the one that gives the best average performances on MDPs drawn according to $p_\mathcal{M}(\cdot)$. As candidate E/E strategies, we consider index-based strategies parametrized by small formulas combining variables that include the estimated reward function, the number of times each transition has occurred and the optimal value functions and the optimal value functions $\hat{V}$ and $\hat{Q}$ of the estimated MDP (obtained through value iteration). The search for the best formula is formalized as a multi-armed bandit problem, each arm being associated with a formula. We experimentally compare the performances of the approach with R-max as well as with $\epsilon$-Greedy strategies and the results are promising. Sat, 12 Jan 2013 00:00:00 +0000 https://proceedings.mlr.press/v24/castronovo12a.html https://proceedings.mlr.press/v24/castronovo12a.html