Policy-Gradients for PSRs and POMDPs

Douglas Aberdeen, Olivier Buffet, Owen Thomas
; Proceedings of the Eleventh International Conference on Artificial Intelligence and Statistics, PMLR 2:3-10, 2007.

Abstract

In uncertain and partially observable environments control policies must be a function of the complete history of actions and observations. Rather than present an ever growing history to a learner, we instead track sufficient statistics of the history and map those to a control policy. The mapping has typically been done using dynamic programming, requiring large amounts of memory. We present a general approach to mapping sufficient statistics directly to control policies by combining the tracking of sufficient statistics with the use of policy-gradient reinforcement learning. The best known sufficient statistic is the belief state, computed from a known or estimated partially observable Markov decision process (POMDP) model. More recently, predictive state representations (PSRs) have emerged as a potentially compact model of partially observable systems. Our experiments explore the usefulness of both of these sufficient statistics, exact and estimated, in direct policy-search.

Cite this Paper


BibTeX
@InProceedings{pmlr-v2-aberdeen07a, title = {Policy-Gradients for PSRs and POMDPs}, author = {Douglas Aberdeen and Olivier Buffet and Owen Thomas}, booktitle = {Proceedings of the Eleventh International Conference on Artificial Intelligence and Statistics}, pages = {3--10}, year = {2007}, editor = {Marina Meila and Xiaotong Shen}, volume = {2}, series = {Proceedings of Machine Learning Research}, address = {San Juan, Puerto Rico}, month = {21--24 Mar}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v2/aberdeen07a/aberdeen07a.pdf}, url = {http://proceedings.mlr.press/v2/aberdeen07a.html}, abstract = {In uncertain and partially observable environments control policies must be a function of the complete history of actions and observations. Rather than present an ever growing history to a learner, we instead track sufficient statistics of the history and map those to a control policy. The mapping has typically been done using dynamic programming, requiring large amounts of memory. We present a general approach to mapping sufficient statistics directly to control policies by combining the tracking of sufficient statistics with the use of policy-gradient reinforcement learning. The best known sufficient statistic is the belief state, computed from a known or estimated partially observable Markov decision process (POMDP) model. More recently, predictive state representations (PSRs) have emerged as a potentially compact model of partially observable systems. Our experiments explore the usefulness of both of these sufficient statistics, exact and estimated, in direct policy-search.} }
Endnote
%0 Conference Paper %T Policy-Gradients for PSRs and POMDPs %A Douglas Aberdeen %A Olivier Buffet %A Owen Thomas %B Proceedings of the Eleventh International Conference on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 2007 %E Marina Meila %E Xiaotong Shen %F pmlr-v2-aberdeen07a %I PMLR %J Proceedings of Machine Learning Research %P 3--10 %U http://proceedings.mlr.press %V 2 %W PMLR %X In uncertain and partially observable environments control policies must be a function of the complete history of actions and observations. Rather than present an ever growing history to a learner, we instead track sufficient statistics of the history and map those to a control policy. The mapping has typically been done using dynamic programming, requiring large amounts of memory. We present a general approach to mapping sufficient statistics directly to control policies by combining the tracking of sufficient statistics with the use of policy-gradient reinforcement learning. The best known sufficient statistic is the belief state, computed from a known or estimated partially observable Markov decision process (POMDP) model. More recently, predictive state representations (PSRs) have emerged as a potentially compact model of partially observable systems. Our experiments explore the usefulness of both of these sufficient statistics, exact and estimated, in direct policy-search.
RIS
TY - CPAPER TI - Policy-Gradients for PSRs and POMDPs AU - Douglas Aberdeen AU - Olivier Buffet AU - Owen Thomas BT - Proceedings of the Eleventh International Conference on Artificial Intelligence and Statistics PY - 2007/03/11 DA - 2007/03/11 ED - Marina Meila ED - Xiaotong Shen ID - pmlr-v2-aberdeen07a PB - PMLR SP - 3 DP - PMLR EP - 10 L1 - http://proceedings.mlr.press/v2/aberdeen07a/aberdeen07a.pdf UR - http://proceedings.mlr.press/v2/aberdeen07a.html AB - In uncertain and partially observable environments control policies must be a function of the complete history of actions and observations. Rather than present an ever growing history to a learner, we instead track sufficient statistics of the history and map those to a control policy. The mapping has typically been done using dynamic programming, requiring large amounts of memory. We present a general approach to mapping sufficient statistics directly to control policies by combining the tracking of sufficient statistics with the use of policy-gradient reinforcement learning. The best known sufficient statistic is the belief state, computed from a known or estimated partially observable Markov decision process (POMDP) model. More recently, predictive state representations (PSRs) have emerged as a potentially compact model of partially observable systems. Our experiments explore the usefulness of both of these sufficient statistics, exact and estimated, in direct policy-search. ER -
APA
Aberdeen, D., Buffet, O. & Thomas, O.. (2007). Policy-Gradients for PSRs and POMDPs. Proceedings of the Eleventh International Conference on Artificial Intelligence and Statistics, in PMLR 2:3-10

Related Material