Scalable Bilinear Pi Learning Using State and Action Features
[edit]
Proceedings of the 35th International Conference on Machine Learning, PMLR 80:834843, 2018.
Abstract
Approximate linear programming (ALP) represents one of the major algorithmic families to solve largescale Markov decision processes (MDP). In this work, we study a primaldual formulation of the ALP, and develop a scalable, modelfree algorithm called bilinear $\pi$ learning for reinforcement learning when a sampling oracle is provided. This algorithm enjoys a number of advantages. First, it adopts linear and bilinear models to represent the highdimensional value function and stateaction distributions, respectively, using given state and action features. Its runtime complexity depends on the number of features, not the size of the underlying MDPs. Second, it operates in a fully online fashion without having to store any sample, thus having minimal memory footprint. Third, we prove that it is sampleefficient, solving for the optimal policy to high precision with a sample complexity linear in the dimension of the parameter space.
Related Material


