Experience Replay with Likelihood-free Importance Weights

Samarth Sinha; Jiaming Song; Animesh Garg; Stefano Ermon

Experience Replay with Likelihood-free Importance Weights

Samarth Sinha, Jiaming Song, Animesh Garg, Stefano Ermon

Proceedings of The 4th Annual Learning for Dynamics and Control Conference, PMLR 168:110-123, 2022.

Abstract

The use of past experiences to accelerate temporal difference (TD) learning of value functions, or experience replay, is a key component in deep reinforcement learning methods such as actor-critic.In this work, we propose to re-weight experiences based on their likelihood under the stationary distribution of the current policy, and justify this with a contraction argument over the Bellman evaluation operator. The resulting TD objective encourages small approximation errors on the value function over frequently encountered states. To balance bias (from off-policy experiences) and variance (from on-policy experiences), we use a likelihood-free density ratio estimator between on-policy and off-policy experiences, and use the learned ratios as the prioritization weights. We apply the proposed approach empirically on Soft Actor Critic (SAC), Double DQN and Data-regularized Q(DrQ), over 12 Atari environments and 6 tasks from the DeepMind control suite. We achieve superior sample complexity on 9 out of 12 Atari environments and 16 out of 24 method-task combinations for DCS compared to the best baselines.

Cite this Paper

BibTeX


@InProceedings{pmlr-v168-sinha22a,
  title = 	 {Experience Replay with Likelihood-free Importance Weights},
  author =       {Sinha, Samarth and Song, Jiaming and Garg, Animesh and Ermon, Stefano},
  booktitle = 	 {Proceedings of The 4th Annual Learning for Dynamics and Control Conference},
  pages = 	 {110--123},
  year = 	 {2022},
  editor = 	 {Firoozi, Roya and Mehr, Negar and Yel, Esen and Antonova, Rika and Bohg, Jeannette and Schwager, Mac and Kochenderfer, Mykel},
  volume = 	 {168},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {23--24 Jun},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v168/sinha22a/sinha22a.pdf},
  url = 	 {https://proceedings.mlr.press/v168/sinha22a.html},
  abstract = 	 {The use of past experiences to accelerate temporal difference (TD) learning of value functions, or experience replay, is a key component in deep reinforcement learning methods such as actor-critic.In this work, we propose to re-weight experiences based on their likelihood under the stationary distribution of the current policy, and justify this with a contraction argument over the Bellman evaluation operator. The resulting TD objective encourages small approximation errors on the value function over frequently encountered states.  To balance bias (from off-policy experiences) and variance (from on-policy experiences), we use a likelihood-free density ratio estimator between on-policy and off-policy experiences, and use the learned ratios as the prioritization weights. We apply the proposed approach empirically on Soft Actor Critic (SAC), Double DQN and Data-regularized Q(DrQ), over 12 Atari environments and 6 tasks from the DeepMind control suite. We achieve superior sample complexity on 9 out of 12 Atari environments and 16 out of 24 method-task combinations for DCS compared to the best baselines.}
}

Endnote

%0 Conference Paper
%T Experience Replay with Likelihood-free Importance Weights
%A Samarth Sinha
%A Jiaming Song
%A Animesh Garg
%A Stefano Ermon
%B Proceedings of The 4th Annual Learning for Dynamics and Control Conference
%C Proceedings of Machine Learning Research
%D 2022
%E Roya Firoozi
%E Negar Mehr
%E Esen Yel
%E Rika Antonova
%E Jeannette Bohg
%E Mac Schwager
%E Mykel Kochenderfer	
%F pmlr-v168-sinha22a
%I PMLR
%P 110--123
%U https://proceedings.mlr.press/v168/sinha22a.html
%V 168
%X The use of past experiences to accelerate temporal difference (TD) learning of value functions, or experience replay, is a key component in deep reinforcement learning methods such as actor-critic.In this work, we propose to re-weight experiences based on their likelihood under the stationary distribution of the current policy, and justify this with a contraction argument over the Bellman evaluation operator. The resulting TD objective encourages small approximation errors on the value function over frequently encountered states.  To balance bias (from off-policy experiences) and variance (from on-policy experiences), we use a likelihood-free density ratio estimator between on-policy and off-policy experiences, and use the learned ratios as the prioritization weights. We apply the proposed approach empirically on Soft Actor Critic (SAC), Double DQN and Data-regularized Q(DrQ), over 12 Atari environments and 6 tasks from the DeepMind control suite. We achieve superior sample complexity on 9 out of 12 Atari environments and 16 out of 24 method-task combinations for DCS compared to the best baselines.

APA


Sinha, S., Song, J., Garg, A. & Ermon, S.. (2022). Experience Replay with Likelihood-free Importance Weights. Proceedings of The 4th Annual Learning for Dynamics and Control Conference, in Proceedings of Machine Learning Research 168:110-123 Available from https://proceedings.mlr.press/v168/sinha22a.html.

Related Material

Download PDF