Modelling Policies in MDPs in Reproducing Kernel Hilbert Space

[edit]

Guy Lever, Ronnie Stafford ;
Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, PMLR 38:590-598, 2015.

Abstract

We consider modelling policies for MDPs in (vector-valued) reproducing kernel Hilbert function spaces (RKHS). This enables us to work “non-parametrically” in a rich function class, and provides the ability to learn complex policies. We present a framework for performing gradient-based policy optimization in the RKHS, deriving the functional gradient of the return for our policy, which has a simple form and can be estimated efficiently. The policy representation naturally focuses on the relevant region of state space defined by the policy trajectories, and does not rely on a-priori defined basis points; this can be an advantage in high dimensions where suitable basis points may be difficult to define a-priori. The method is adaptive in the sense that the policy representation will naturally adapt to the complexity of the policy being modelled, which is achieved with standard efficient sparsification tools in an RKHS. We argue that finding a good kernel on states can be easier then remetrizing a high dimensional feature space. We demonstrate the approach on benchmark domains and a simulated quadrocopter navigation task.

Related Material