Striving for Simplicity and Performance in Off-Policy DRL: Output Normalization and Non-Uniform Sampling

Che Wang, Yanqiu Wu, Quan Vuong, Keith Ross
Proceedings of the 37th International Conference on Machine Learning, PMLR 119:10070-10080, 2020.

Abstract

We aim to develop off-policy DRL algorithms that not only exceed state-of-the-art performance but are also simple and minimalistic. For standard continuous control benchmarks, Soft Actor-Critic (SAC), which employs entropy maximization, currently provides state-of-the-art performance. We first demonstrate that the entropy term in SAC addresses action saturation due to the bounded nature of the action spaces, with this insight, we propose a streamlined algorithm with a simple normalization scheme or with inverted gradients. We show that both approaches can match SAC’s sample efficiency performance without the need of entropy maximization, we then propose a simple non-uniform sampling method for selecting transitions from the replay buffer during training. Extensive experimental results demonstrate that our proposed sampling scheme leads to state of the art sample efficiency on challenging continuous control tasks. We combine all of our findings into one simple algorithm, which we call Streamlined Off Policy with Emphasizing Recent Experience, for which we provide robust public-domain code.

Cite this Paper


BibTeX
@InProceedings{pmlr-v119-wang20x, title = {Striving for Simplicity and Performance in Off-Policy {DRL}: Output Normalization and Non-Uniform Sampling}, author = {Wang, Che and Wu, Yanqiu and Vuong, Quan and Ross, Keith}, booktitle = {Proceedings of the 37th International Conference on Machine Learning}, pages = {10070--10080}, year = {2020}, editor = {III, Hal Daumé and Singh, Aarti}, volume = {119}, series = {Proceedings of Machine Learning Research}, month = {13--18 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v119/wang20x/wang20x.pdf}, url = {https://proceedings.mlr.press/v119/wang20x.html}, abstract = {We aim to develop off-policy DRL algorithms that not only exceed state-of-the-art performance but are also simple and minimalistic. For standard continuous control benchmarks, Soft Actor-Critic (SAC), which employs entropy maximization, currently provides state-of-the-art performance. We first demonstrate that the entropy term in SAC addresses action saturation due to the bounded nature of the action spaces, with this insight, we propose a streamlined algorithm with a simple normalization scheme or with inverted gradients. We show that both approaches can match SAC’s sample efficiency performance without the need of entropy maximization, we then propose a simple non-uniform sampling method for selecting transitions from the replay buffer during training. Extensive experimental results demonstrate that our proposed sampling scheme leads to state of the art sample efficiency on challenging continuous control tasks. We combine all of our findings into one simple algorithm, which we call Streamlined Off Policy with Emphasizing Recent Experience, for which we provide robust public-domain code.} }
Endnote
%0 Conference Paper %T Striving for Simplicity and Performance in Off-Policy DRL: Output Normalization and Non-Uniform Sampling %A Che Wang %A Yanqiu Wu %A Quan Vuong %A Keith Ross %B Proceedings of the 37th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2020 %E Hal Daumé III %E Aarti Singh %F pmlr-v119-wang20x %I PMLR %P 10070--10080 %U https://proceedings.mlr.press/v119/wang20x.html %V 119 %X We aim to develop off-policy DRL algorithms that not only exceed state-of-the-art performance but are also simple and minimalistic. For standard continuous control benchmarks, Soft Actor-Critic (SAC), which employs entropy maximization, currently provides state-of-the-art performance. We first demonstrate that the entropy term in SAC addresses action saturation due to the bounded nature of the action spaces, with this insight, we propose a streamlined algorithm with a simple normalization scheme or with inverted gradients. We show that both approaches can match SAC’s sample efficiency performance without the need of entropy maximization, we then propose a simple non-uniform sampling method for selecting transitions from the replay buffer during training. Extensive experimental results demonstrate that our proposed sampling scheme leads to state of the art sample efficiency on challenging continuous control tasks. We combine all of our findings into one simple algorithm, which we call Streamlined Off Policy with Emphasizing Recent Experience, for which we provide robust public-domain code.
APA
Wang, C., Wu, Y., Vuong, Q. & Ross, K.. (2020). Striving for Simplicity and Performance in Off-Policy DRL: Output Normalization and Non-Uniform Sampling. Proceedings of the 37th International Conference on Machine Learning, in Proceedings of Machine Learning Research 119:10070-10080 Available from https://proceedings.mlr.press/v119/wang20x.html.

Related Material