Efficient Planning under Partial Observability with Unnormalized Q Functions and Spectral Learning
Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, PMLR 108:2852-2862, 2020.
Learning and planning in partially-observable domains is one of the most difficult problems in reinforcement learning. Traditional methods consider these two problems as independent, resulting in a classic two-stage paradigm: first learn the environment dynamics and then compute the optimal policy accordingly. This approach, however, disconnects the reward information from the learning of the environment model and can consequently lead to representations that are sample inefficient and time consuming for planning purpose. In this paper, we propose a novel algorithm that incorporate reward information into the representations of the environment to unify these two stages. Our algorithm is closely related to the spectral learning algorithm for predicitive state representations and offers appealing theoretical guarantees and time complexity. We empirically show on two domains that our approach is more sample and time efficient compared to classical methods.