On Average Reward Policy Evaluation in Infinite-State Partially Observable Systems
Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics, PMLR 22:449-457, 2012.
We investigate the problem of estimating the average reward of given decision policies in discrete-time controllable dynamical systems with finite action and observation sets, but possibly infinite state space. Unlike in systems with finite state spaces, in infinite–state systems the expected reward for some policies might not exist, so policy evaluation, which is a key step in optimal control methods, might fail. Our main analysis tool is Ergodic theory, which allows learning potentially useful quantities from the system without building a model. Our main contribution is three-fold. First, we present several dynamical systems that demonstrate the difficulty of learning in the general case, without making additional assumptions. We state the necessary condition that the underlying system must satisfy to be amenable for learning. Second, we discuss the relationship between this condition and state-of-the-art predictive representations, and we show that there are systems that satisfy the above condition but cannot be modeled by such representations. Third, we establish sufficient conditions for average-reward policy evaluation in this setting.