Offline Reinforcement Learning with Fisher Divergence Critic Regularization

Ilya Kostrikov, Rob Fergus, Jonathan Tompson, Ofir Nachum
Proceedings of the 38th International Conference on Machine Learning, PMLR 139:5774-5783, 2021.

Abstract

Many modern approaches to offline Reinforcement Learning (RL) utilize behavior regularization, typically augmenting a model-free actor critic algorithm with a penalty measuring divergence of the policy from the offline data. In this work, we propose an alternative approach to encouraging the learned policy to stay close to the data, namely parameterizing the critic as the log-behavior-policy, which generated the offline data, plus a state-action value offset term, which can be learned using a neural network. Behavior regularization then corresponds to an appropriate regularizer on the offset term. We propose using a gradient penalty regularizer for the offset term and demonstrate its equivalence to Fisher divergence regularization, suggesting connections to the score matching and generative energy-based model literature. We thus term our resulting algorithm Fisher-BRC (Behavior Regularized Critic). On standard offline RL benchmarks, Fisher-BRC achieves both improved performance and faster convergence over existing state-of-the-art methods.

Cite this Paper


BibTeX
@InProceedings{pmlr-v139-kostrikov21a, title = {Offline Reinforcement Learning with Fisher Divergence Critic Regularization}, author = {Kostrikov, Ilya and Fergus, Rob and Tompson, Jonathan and Nachum, Ofir}, booktitle = {Proceedings of the 38th International Conference on Machine Learning}, pages = {5774--5783}, year = {2021}, editor = {Meila, Marina and Zhang, Tong}, volume = {139}, series = {Proceedings of Machine Learning Research}, month = {18--24 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v139/kostrikov21a/kostrikov21a.pdf}, url = {https://proceedings.mlr.press/v139/kostrikov21a.html}, abstract = {Many modern approaches to offline Reinforcement Learning (RL) utilize behavior regularization, typically augmenting a model-free actor critic algorithm with a penalty measuring divergence of the policy from the offline data. In this work, we propose an alternative approach to encouraging the learned policy to stay close to the data, namely parameterizing the critic as the log-behavior-policy, which generated the offline data, plus a state-action value offset term, which can be learned using a neural network. Behavior regularization then corresponds to an appropriate regularizer on the offset term. We propose using a gradient penalty regularizer for the offset term and demonstrate its equivalence to Fisher divergence regularization, suggesting connections to the score matching and generative energy-based model literature. We thus term our resulting algorithm Fisher-BRC (Behavior Regularized Critic). On standard offline RL benchmarks, Fisher-BRC achieves both improved performance and faster convergence over existing state-of-the-art methods.} }
Endnote
%0 Conference Paper %T Offline Reinforcement Learning with Fisher Divergence Critic Regularization %A Ilya Kostrikov %A Rob Fergus %A Jonathan Tompson %A Ofir Nachum %B Proceedings of the 38th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2021 %E Marina Meila %E Tong Zhang %F pmlr-v139-kostrikov21a %I PMLR %P 5774--5783 %U https://proceedings.mlr.press/v139/kostrikov21a.html %V 139 %X Many modern approaches to offline Reinforcement Learning (RL) utilize behavior regularization, typically augmenting a model-free actor critic algorithm with a penalty measuring divergence of the policy from the offline data. In this work, we propose an alternative approach to encouraging the learned policy to stay close to the data, namely parameterizing the critic as the log-behavior-policy, which generated the offline data, plus a state-action value offset term, which can be learned using a neural network. Behavior regularization then corresponds to an appropriate regularizer on the offset term. We propose using a gradient penalty regularizer for the offset term and demonstrate its equivalence to Fisher divergence regularization, suggesting connections to the score matching and generative energy-based model literature. We thus term our resulting algorithm Fisher-BRC (Behavior Regularized Critic). On standard offline RL benchmarks, Fisher-BRC achieves both improved performance and faster convergence over existing state-of-the-art methods.
APA
Kostrikov, I., Fergus, R., Tompson, J. & Nachum, O.. (2021). Offline Reinforcement Learning with Fisher Divergence Critic Regularization. Proceedings of the 38th International Conference on Machine Learning, in Proceedings of Machine Learning Research 139:5774-5783 Available from https://proceedings.mlr.press/v139/kostrikov21a.html.

Related Material