Experience Replay with Likelihood-free Importance Weights
Proceedings of The 4th Annual Learning for Dynamics and Control Conference, PMLR 168:110-123, 2022.
The use of past experiences to accelerate temporal difference (TD) learning of value functions, or experience replay, is a key component in deep reinforcement learning methods such as actor-critic.In this work, we propose to re-weight experiences based on their likelihood under the stationary distribution of the current policy, and justify this with a contraction argument over the Bellman evaluation operator. The resulting TD objective encourages small approximation errors on the value function over frequently encountered states. To balance bias (from off-policy experiences) and variance (from on-policy experiences), we use a likelihood-free density ratio estimator between on-policy and off-policy experiences, and use the learned ratios as the prioritization weights. We apply the proposed approach empirically on Soft Actor Critic (SAC), Double DQN and Data-regularized Q(DrQ), over 12 Atari environments and 6 tasks from the DeepMind control suite. We achieve superior sample complexity on 9 out of 12 Atari environments and 16 out of 24 method-task combinations for DCS compared to the best baselines.