Stabilizing Off-Policy Deep Reinforcement Learning from Pixels

Edoardo Cetin; Philip J Ball; Stephen Roberts; Oya Celiktutan

Stabilizing Off-Policy Deep Reinforcement Learning from Pixels

Edoardo Cetin, Philip J Ball, Stephen Roberts, Oya Celiktutan

Proceedings of the 39th International Conference on Machine Learning, PMLR 162:2784-2810, 2022.

Abstract

Off-policy reinforcement learning (RL) from pixel observations is notoriously unstable. As a result, many successful algorithms must combine different domain-specific practices and auxiliary losses to learn meaningful behaviors in complex environments. In this work, we provide novel analysis demonstrating that these instabilities arise from performing temporal-difference learning with a convolutional encoder and low-magnitude rewards. We show that this new visual deadly triad causes unstable training and premature convergence to degenerate solutions, a phenomenon we name catastrophic self-overfitting. Based on our analysis, we propose A-LIX, a method providing adaptive regularization to the encoder’s gradients that explicitly prevents the occurrence of catastrophic self-overfitting using a dual objective. By applying A-LIX, we significantly outperform the prior state-of-the-art on the DeepMind Control and Atari benchmarks without any data augmentation or auxiliary losses.

Cite this Paper

BibTeX


@InProceedings{pmlr-v162-cetin22a,
  title = 	 {Stabilizing Off-Policy Deep Reinforcement Learning from Pixels},
  author =       {Cetin, Edoardo and Ball, Philip J and Roberts, Stephen and Celiktutan, Oya},
  booktitle = 	 {Proceedings of the 39th International Conference on Machine Learning},
  pages = 	 {2784--2810},
  year = 	 {2022},
  editor = 	 {Chaudhuri, Kamalika and Jegelka, Stefanie and Song, Le and Szepesvari, Csaba and Niu, Gang and Sabato, Sivan},
  volume = 	 {162},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {17--23 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v162/cetin22a/cetin22a.pdf},
  url = 	 {https://proceedings.mlr.press/v162/cetin22a.html},
  abstract = 	 {Off-policy reinforcement learning (RL) from pixel observations is notoriously unstable. As a result, many successful algorithms must combine different domain-specific practices and auxiliary losses to learn meaningful behaviors in complex environments. In this work, we provide novel analysis demonstrating that these instabilities arise from performing temporal-difference learning with a convolutional encoder and low-magnitude rewards. We show that this new visual deadly triad causes unstable training and premature convergence to degenerate solutions, a phenomenon we name catastrophic self-overfitting. Based on our analysis, we propose A-LIX, a method providing adaptive regularization to the encoder’s gradients that explicitly prevents the occurrence of catastrophic self-overfitting using a dual objective. By applying A-LIX, we significantly outperform the prior state-of-the-art on the DeepMind Control and Atari benchmarks without any data augmentation or auxiliary losses.}
}

Endnote

%0 Conference Paper
%T Stabilizing Off-Policy Deep Reinforcement Learning from Pixels
%A Edoardo Cetin
%A Philip J Ball
%A Stephen Roberts
%A Oya Celiktutan
%B Proceedings of the 39th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2022
%E Kamalika Chaudhuri
%E Stefanie Jegelka
%E Le Song
%E Csaba Szepesvari
%E Gang Niu
%E Sivan Sabato	
%F pmlr-v162-cetin22a
%I PMLR
%P 2784--2810
%U https://proceedings.mlr.press/v162/cetin22a.html
%V 162
%X Off-policy reinforcement learning (RL) from pixel observations is notoriously unstable. As a result, many successful algorithms must combine different domain-specific practices and auxiliary losses to learn meaningful behaviors in complex environments. In this work, we provide novel analysis demonstrating that these instabilities arise from performing temporal-difference learning with a convolutional encoder and low-magnitude rewards. We show that this new visual deadly triad causes unstable training and premature convergence to degenerate solutions, a phenomenon we name catastrophic self-overfitting. Based on our analysis, we propose A-LIX, a method providing adaptive regularization to the encoder’s gradients that explicitly prevents the occurrence of catastrophic self-overfitting using a dual objective. By applying A-LIX, we significantly outperform the prior state-of-the-art on the DeepMind Control and Atari benchmarks without any data augmentation or auxiliary losses.

APA


Cetin, E., Ball, P.J., Roberts, S. & Celiktutan, O.. (2022). Stabilizing Off-Policy Deep Reinforcement Learning from Pixels. Proceedings of the 39th International Conference on Machine Learning, in Proceedings of Machine Learning Research 162:2784-2810 Available from https://proceedings.mlr.press/v162/cetin22a.html.

Related Material

Download PDF