Recurrent Predictive State Policy Networks

Ahmed Hefny; Zita Marinho; Wen Sun; Siddhartha Srinivasa; Geoffrey Gordon

Recurrent Predictive State Policy Networks

Ahmed Hefny, Zita Marinho, Wen Sun, Siddhartha Srinivasa, Geoffrey Gordon

Proceedings of the 35th International Conference on Machine Learning, PMLR 80:1949-1958, 2018.

Abstract

We introduce Recurrent Predictive State Policy(RPSP) networks, a recurrent architecture that brings insights from predictive state representations to reinforcement learning in partially ob-servable environments. Predictive state policy networks consist of a recursive filter, which keeps track of a belief about the state of the environment, and a reactive policy that directly maps beliefs to actions, to maximize the cumulative reward. The recursive filter leverages predictive state representations (PSRs) (Rosencrantz & Gordon, 2004; Sun et al., 2016) by modeling predictive state{—}a prediction of the distribution of future observations conditioned on history and future actions.This representation gives rise to a rich class of statistically consistent algorithms (Hefny et al.,2017) to initialize the recursive filter. Predictive stats serves as an equivalent representation of a belief state. Therefore, the policy component of the RPSP-network can be purely reactive, simplifying training while still allowing optimal behavior. Moreover, we use the PSR interpretation during training as well, by incorporating prediction error in the loss function. The entire network (recursive filter and reactive policy) is still differentiable and can be trained using gradient-based methods. We optimize our policy using a combination of policy gradient based on rewards (Williams, 1992)and gradient descent based on prediction error.We show the efficacy of RPSP-networks on a set of robotic control tasks from OpenAI Gym. We empirically show that RPSP-networks perform well compared with memory-preserving networks such as GRUs, as well as finite memory models, being the overall best performing method.

Cite this Paper

BibTeX


@InProceedings{pmlr-v80-hefny18a,
  title = 	 {Recurrent Predictive State Policy Networks},
  author =       {Hefny, Ahmed and Marinho, Zita and Sun, Wen and Srinivasa, Siddhartha and Gordon, Geoffrey},
  booktitle = 	 {Proceedings of the 35th International Conference on Machine Learning},
  pages = 	 {1949--1958},
  year = 	 {2018},
  editor = 	 {Dy, Jennifer and Krause, Andreas},
  volume = 	 {80},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {10--15 Jul},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v80/hefny18a/hefny18a.pdf},
  url = 	 {https://proceedings.mlr.press/v80/hefny18a.html},
  abstract = 	 {We introduce Recurrent Predictive State Policy(RPSP) networks, a recurrent architecture that brings insights from predictive state representations to reinforcement learning in partially ob-servable environments. Predictive state policy networks consist of a recursive filter, which keeps track of a belief about the state of the environment, and a reactive policy that directly maps beliefs to actions, to maximize the cumulative reward. The recursive filter leverages predictive state representations (PSRs) (Rosencrantz & Gordon, 2004; Sun et al., 2016) by modeling predictive state{—}a prediction of the distribution of future observations conditioned on history and future actions.This representation gives rise to a rich class of statistically consistent algorithms (Hefny et al.,2017) to initialize the recursive filter. Predictive stats serves as an equivalent representation of a belief state. Therefore, the policy component of the RPSP-network can be purely reactive, simplifying training while still allowing optimal behavior. Moreover, we use the PSR interpretation during training as well, by incorporating prediction error in the loss function. The entire network (recursive filter and reactive policy) is still differentiable and can be trained using gradient-based methods. We optimize our policy using a combination of policy gradient based on rewards (Williams, 1992)and gradient descent based on prediction error.We show the efficacy of RPSP-networks on a set of robotic control tasks from OpenAI Gym. We empirically show that RPSP-networks perform well compared with memory-preserving networks such as GRUs, as well as finite memory models, being the overall best performing method.}
}

Endnote

%0 Conference Paper
%T Recurrent Predictive State Policy Networks
%A Ahmed Hefny
%A Zita Marinho
%A Wen Sun
%A Siddhartha Srinivasa
%A Geoffrey Gordon
%B Proceedings of the 35th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2018
%E Jennifer Dy
%E Andreas Krause	
%F pmlr-v80-hefny18a
%I PMLR
%P 1949--1958
%U https://proceedings.mlr.press/v80/hefny18a.html
%V 80
%X We introduce Recurrent Predictive State Policy(RPSP) networks, a recurrent architecture that brings insights from predictive state representations to reinforcement learning in partially ob-servable environments. Predictive state policy networks consist of a recursive filter, which keeps track of a belief about the state of the environment, and a reactive policy that directly maps beliefs to actions, to maximize the cumulative reward. The recursive filter leverages predictive state representations (PSRs) (Rosencrantz & Gordon, 2004; Sun et al., 2016) by modeling predictive state{—}a prediction of the distribution of future observations conditioned on history and future actions.This representation gives rise to a rich class of statistically consistent algorithms (Hefny et al.,2017) to initialize the recursive filter. Predictive stats serves as an equivalent representation of a belief state. Therefore, the policy component of the RPSP-network can be purely reactive, simplifying training while still allowing optimal behavior. Moreover, we use the PSR interpretation during training as well, by incorporating prediction error in the loss function. The entire network (recursive filter and reactive policy) is still differentiable and can be trained using gradient-based methods. We optimize our policy using a combination of policy gradient based on rewards (Williams, 1992)and gradient descent based on prediction error.We show the efficacy of RPSP-networks on a set of robotic control tasks from OpenAI Gym. We empirically show that RPSP-networks perform well compared with memory-preserving networks such as GRUs, as well as finite memory models, being the overall best performing method.

APA


Hefny, A., Marinho, Z., Sun, W., Srinivasa, S. & Gordon, G.. (2018). Recurrent Predictive State Policy Networks. Proceedings of the 35th International Conference on Machine Learning, in Proceedings of Machine Learning Research 80:1949-1958 Available from https://proceedings.mlr.press/v80/hefny18a.html.

Recurrent Predictive State Policy Networks

Abstract

Cite this Paper

Related Material