Average Reward Optimization Objective In Partially Observable Domains

Yuri Grinberg; Doina Precup

Average Reward Optimization Objective In Partially Observable Domains

Yuri Grinberg, Doina Precup

Proceedings of the 30th International Conference on Machine Learning, PMLR 28(1):320-328, 2013.

Abstract

We consider the problem of average reward optimization in domains with partial observability, within the modeling framework of linear predictive state representations (PSRs). The key to average-reward computation is to have a well-defined stationary behavior of a system, so the required averages can be computed. If, additionally, the stationary behavior varies smoothly with changes in policy parameters, average-reward control through policy search also becomes a possibility. In this paper, we show that PSRs have a well-behaved stationary distribution, which is a rational function of policy parameters. Based on this result, we define a related reward process particularly suitable for average reward optimization, and analyze its properties. We show that in such a predictive state reward process, the average reward is a rational function of the policy parameters, whose complexity depends on the dimension of the underlying linear PSR. This result suggests that average reward-based policy search methods can be effective when the dimension of the system is small, even when the system representation in the POMDP framework requires many hidden states. We provide illustrative examples of this type.

Cite this Paper

BibTeX


@InProceedings{pmlr-v28-grinberg13,
  title = 	 {Average Reward Optimization Objective In Partially Observable Domains},
  author = 	 {Grinberg, Yuri and Precup, Doina},
  booktitle = 	 {Proceedings of the 30th International Conference on Machine Learning},
  pages = 	 {320--328},
  year = 	 {2013},
  editor = 	 {Dasgupta, Sanjoy and McAllester, David},
  volume = 	 {28},
  number =       {1},
  series = 	 {Proceedings of Machine Learning Research},
  address = 	 {Atlanta, Georgia, USA},
  month = 	 {17--19 Jun},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v28/grinberg13.pdf},
  url = 	 {https://proceedings.mlr.press/v28/grinberg13.html},
  abstract = 	 {We consider the problem of average reward optimization in domains with partial observability, within the modeling framework of linear predictive state representations (PSRs). The key to average-reward computation is to have a well-defined stationary behavior of a system, so the required averages can be computed. If, additionally, the stationary behavior varies smoothly with changes in policy parameters, average-reward control through policy search also becomes a possibility. In this paper, we show that PSRs have a well-behaved stationary distribution, which is a rational function of policy parameters.  Based on this result, we define a related reward process particularly suitable for average reward optimization, and analyze its properties. We show that in such a predictive state reward process, the average reward is a rational function of the policy parameters, whose complexity depends on the dimension of the underlying linear PSR. This result suggests that average reward-based policy search methods can be effective when the dimension of the system is small, even when the system representation in the POMDP framework requires many hidden states. We provide illustrative examples of this type.}
}

Endnote

%0 Conference Paper
%T Average Reward Optimization Objective In Partially Observable Domains
%A Yuri Grinberg
%A Doina Precup
%B Proceedings of the 30th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2013
%E Sanjoy Dasgupta
%E David McAllester	
%F pmlr-v28-grinberg13
%I PMLR
%P 320--328
%U https://proceedings.mlr.press/v28/grinberg13.html
%V 28
%N 1
%X We consider the problem of average reward optimization in domains with partial observability, within the modeling framework of linear predictive state representations (PSRs). The key to average-reward computation is to have a well-defined stationary behavior of a system, so the required averages can be computed. If, additionally, the stationary behavior varies smoothly with changes in policy parameters, average-reward control through policy search also becomes a possibility. In this paper, we show that PSRs have a well-behaved stationary distribution, which is a rational function of policy parameters.  Based on this result, we define a related reward process particularly suitable for average reward optimization, and analyze its properties. We show that in such a predictive state reward process, the average reward is a rational function of the policy parameters, whose complexity depends on the dimension of the underlying linear PSR. This result suggests that average reward-based policy search methods can be effective when the dimension of the system is small, even when the system representation in the POMDP framework requires many hidden states. We provide illustrative examples of this type.

RIS


TY  - CPAPER
TI  - Average Reward Optimization Objective In Partially Observable Domains
AU  - Yuri Grinberg
AU  - Doina Precup
BT  - Proceedings of the 30th International Conference on Machine Learning
DA  - 2013/02/13
ED  - Sanjoy Dasgupta
ED  - David McAllester	
ID  - pmlr-v28-grinberg13
PB  - PMLR
DP  - Proceedings of Machine Learning Research
VL  - 28
IS  - 1
SP  - 320
EP  - 328
L1  - http://proceedings.mlr.press/v28/grinberg13.pdf
UR  - https://proceedings.mlr.press/v28/grinberg13.html
AB  - We consider the problem of average reward optimization in domains with partial observability, within the modeling framework of linear predictive state representations (PSRs). The key to average-reward computation is to have a well-defined stationary behavior of a system, so the required averages can be computed. If, additionally, the stationary behavior varies smoothly with changes in policy parameters, average-reward control through policy search also becomes a possibility. In this paper, we show that PSRs have a well-behaved stationary distribution, which is a rational function of policy parameters.  Based on this result, we define a related reward process particularly suitable for average reward optimization, and analyze its properties. We show that in such a predictive state reward process, the average reward is a rational function of the policy parameters, whose complexity depends on the dimension of the underlying linear PSR. This result suggests that average reward-based policy search methods can be effective when the dimension of the system is small, even when the system representation in the POMDP framework requires many hidden states. We provide illustrative examples of this type.
ER  -

APA


Grinberg, Y. & Precup, D.. (2013). Average Reward Optimization Objective In Partially Observable Domains. Proceedings of the 30th International Conference on Machine Learning, in Proceedings of Machine Learning Research 28(1):320-328 Available from https://proceedings.mlr.press/v28/grinberg13.html.

Average Reward Optimization Objective In Partially Observable Domains

Abstract

Cite this Paper

Related Material