Policy-Gradients for PSRs and POMDPs

Douglas Aberdeen; Olivier Buffet; Owen Thomas

Policy-Gradients for PSRs and POMDPs

Douglas Aberdeen, Olivier Buffet, Owen Thomas

Proceedings of the Eleventh International Conference on Artificial Intelligence and Statistics, PMLR 2:3-10, 2007.

Abstract

In uncertain and partially observable environments control policies must be a function of the complete history of actions and observations. Rather than present an ever growing history to a learner, we instead track sufficient statistics of the history and map those to a control policy. The mapping has typically been done using dynamic programming, requiring large amounts of memory. We present a general approach to mapping sufficient statistics directly to control policies by combining the tracking of sufficient statistics with the use of policy-gradient reinforcement learning. The best known sufficient statistic is the belief state, computed from a known or estimated partially observable Markov decision process (POMDP) model. More recently, predictive state representations (PSRs) have emerged as a potentially compact model of partially observable systems. Our experiments explore the usefulness of both of these sufficient statistics, exact and estimated, in direct policy-search.

Cite this Paper

BibTeX


@InProceedings{pmlr-v2-aberdeen07a,
  title = 	 {Policy-Gradients for PSRs and POMDPs},
  author = 	 {Aberdeen, Douglas and Buffet, Olivier and Thomas, Owen},
  booktitle = 	 {Proceedings of the Eleventh International Conference on Artificial Intelligence and Statistics},
  pages = 	 {3--10},
  year = 	 {2007},
  editor = 	 {Meila, Marina and Shen, Xiaotong},
  volume = 	 {2},
  series = 	 {Proceedings of Machine Learning Research},
  address = 	 {San Juan, Puerto Rico},
  month = 	 {21--24 Mar},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v2/aberdeen07a/aberdeen07a.pdf},
  url = 	 {https://proceedings.mlr.press/v2/aberdeen07a.html},
  abstract = 	 {In uncertain and partially observable environments control policies must be a function of the complete history of actions and observations. Rather than present an ever growing history to a learner, we instead track sufficient statistics of the history and map those to a control policy. The mapping has typically been done using dynamic programming, requiring large amounts of memory. We present a general approach to mapping sufficient statistics directly to control policies by combining the tracking of sufficient statistics with the use of policy-gradient reinforcement learning. The best known sufficient statistic is the belief state, computed from a known or estimated partially observable Markov decision process (POMDP) model. More recently, predictive state representations (PSRs) have emerged as a potentially compact model of partially observable systems. Our experiments explore the usefulness of both of these sufficient statistics, exact and estimated, in direct policy-search.}
}

Endnote

%0 Conference Paper
%T Policy-Gradients for PSRs and POMDPs
%A Douglas Aberdeen
%A Olivier Buffet
%A Owen Thomas
%B Proceedings of the Eleventh International Conference on Artificial Intelligence and Statistics
%C Proceedings of Machine Learning Research
%D 2007
%E Marina Meila
%E Xiaotong Shen	
%F pmlr-v2-aberdeen07a
%I PMLR
%P 3--10
%U https://proceedings.mlr.press/v2/aberdeen07a.html
%V 2
%X In uncertain and partially observable environments control policies must be a function of the complete history of actions and observations. Rather than present an ever growing history to a learner, we instead track sufficient statistics of the history and map those to a control policy. The mapping has typically been done using dynamic programming, requiring large amounts of memory. We present a general approach to mapping sufficient statistics directly to control policies by combining the tracking of sufficient statistics with the use of policy-gradient reinforcement learning. The best known sufficient statistic is the belief state, computed from a known or estimated partially observable Markov decision process (POMDP) model. More recently, predictive state representations (PSRs) have emerged as a potentially compact model of partially observable systems. Our experiments explore the usefulness of both of these sufficient statistics, exact and estimated, in direct policy-search.

RIS


TY  - CPAPER
TI  - Policy-Gradients for PSRs and POMDPs
AU  - Douglas Aberdeen
AU  - Olivier Buffet
AU  - Owen Thomas
BT  - Proceedings of the Eleventh International Conference on Artificial Intelligence and Statistics
DA  - 2007/03/11
ED  - Marina Meila
ED  - Xiaotong Shen	
ID  - pmlr-v2-aberdeen07a
PB  - PMLR
DP  - Proceedings of Machine Learning Research
VL  - 2
SP  - 3
EP  - 10
L1  - http://proceedings.mlr.press/v2/aberdeen07a/aberdeen07a.pdf
UR  - https://proceedings.mlr.press/v2/aberdeen07a.html
AB  - In uncertain and partially observable environments control policies must be a function of the complete history of actions and observations. Rather than present an ever growing history to a learner, we instead track sufficient statistics of the history and map those to a control policy. The mapping has typically been done using dynamic programming, requiring large amounts of memory. We present a general approach to mapping sufficient statistics directly to control policies by combining the tracking of sufficient statistics with the use of policy-gradient reinforcement learning. The best known sufficient statistic is the belief state, computed from a known or estimated partially observable Markov decision process (POMDP) model. More recently, predictive state representations (PSRs) have emerged as a potentially compact model of partially observable systems. Our experiments explore the usefulness of both of these sufficient statistics, exact and estimated, in direct policy-search.
ER  -

APA


Aberdeen, D., Buffet, O. & Thomas, O.. (2007). Policy-Gradients for PSRs and POMDPs. Proceedings of the Eleventh International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 2:3-10 Available from https://proceedings.mlr.press/v2/aberdeen07a.html.

Related Material

Download PDF