Off-Policy Actor-Critic with Shared Experience Replay

Simon Schmitt; Matteo Hessel; Karen Simonyan

Off-Policy Actor-Critic with Shared Experience Replay

Simon Schmitt, Matteo Hessel, Karen Simonyan

Proceedings of the 37th International Conference on Machine Learning, PMLR 119:8545-8554, 2020.

Abstract

We investigate the combination of actor-critic reinforcement learning algorithms with a uniform large-scale experience replay and propose solutions for two ensuing challenges: (a) efficient actor-critic learning with experience replay (b) the stability of off-policy learning where agents learn from other agents behaviour. To this end we analyze the bias-variance tradeoffs in V-trace, a form of importance sampling for actor-critic methods. Based on our analysis, we then argue for mixing experience sampled from replay with on-policy experience, and propose a new trust region scheme that scales effectively to data distributions where V-trace becomes unstable. We provide extensive empirical validation of the proposed solutions on DMLab-30 and further show the benefits of this setup in two training regimes for Atari: (1) a single agent is trained up until 200M environment frames per game (2) a population of agents is trained up until 200M environment frames each and may share experience. We demonstrate state-of-the-art data efficiency among model-free agents in both regimes.

Cite this Paper

BibTeX


@InProceedings{pmlr-v119-schmitt20a,
  title = 	 {Off-Policy Actor-Critic with Shared Experience Replay},
  author =       {Schmitt, Simon and Hessel, Matteo and Simonyan, Karen},
  booktitle = 	 {Proceedings of the 37th International Conference on Machine Learning},
  pages = 	 {8545--8554},
  year = 	 {2020},
  editor = 	 {III, Hal Daumé and Singh, Aarti},
  volume = 	 {119},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {13--18 Jul},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v119/schmitt20a/schmitt20a.pdf},
  url = 	 {https://proceedings.mlr.press/v119/schmitt20a.html},
  abstract = 	 {We investigate the combination of actor-critic reinforcement learning algorithms with a uniform large-scale experience replay and propose solutions for two ensuing challenges: (a) efficient actor-critic learning with experience replay (b) the stability of off-policy learning where agents learn from other agents behaviour. To this end we analyze the bias-variance tradeoffs in V-trace, a form of importance sampling for actor-critic methods. Based on our analysis, we then argue for mixing experience sampled from replay with on-policy experience, and propose a new trust region scheme that scales effectively to data distributions where V-trace becomes unstable. We provide extensive empirical validation of the proposed solutions on DMLab-30 and further show the benefits of this setup in two training regimes for Atari: (1) a single agent is trained up until 200M environment frames per game (2) a population of agents is trained up until 200M environment frames each and may share experience. We demonstrate state-of-the-art data efficiency among model-free agents in both regimes.}
}

Endnote

%0 Conference Paper
%T Off-Policy Actor-Critic with Shared Experience Replay
%A Simon Schmitt
%A Matteo Hessel
%A Karen Simonyan
%B Proceedings of the 37th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2020
%E Hal Daumé III
%E Aarti Singh	
%F pmlr-v119-schmitt20a
%I PMLR
%P 8545--8554
%U https://proceedings.mlr.press/v119/schmitt20a.html
%V 119
%X We investigate the combination of actor-critic reinforcement learning algorithms with a uniform large-scale experience replay and propose solutions for two ensuing challenges: (a) efficient actor-critic learning with experience replay (b) the stability of off-policy learning where agents learn from other agents behaviour. To this end we analyze the bias-variance tradeoffs in V-trace, a form of importance sampling for actor-critic methods. Based on our analysis, we then argue for mixing experience sampled from replay with on-policy experience, and propose a new trust region scheme that scales effectively to data distributions where V-trace becomes unstable. We provide extensive empirical validation of the proposed solutions on DMLab-30 and further show the benefits of this setup in two training regimes for Atari: (1) a single agent is trained up until 200M environment frames per game (2) a population of agents is trained up until 200M environment frames each and may share experience. We demonstrate state-of-the-art data efficiency among model-free agents in both regimes.

APA


Schmitt, S., Hessel, M. & Simonyan, K.. (2020). Off-Policy Actor-Critic with Shared Experience Replay. Proceedings of the 37th International Conference on Machine Learning, in Proceedings of Machine Learning Research 119:8545-8554 Available from https://proceedings.mlr.press/v119/schmitt20a.html.

Off-Policy Actor-Critic with Shared Experience Replay

Abstract

Cite this Paper

Related Material