Adversarially Trained Actor Critic for Offline Reinforcement Learning

Ching-An Cheng, Tengyang Xie, Nan Jiang, Alekh Agarwal
Proceedings of the 39th International Conference on Machine Learning, PMLR 162:3852-3878, 2022.

Abstract

We propose Adversarially Trained Actor Critic (ATAC), a new model-free algorithm for offline reinforcement learning (RL) under insufficient data coverage, based on the concept of relative pessimism. ATAC is designed as a two-player Stackelberg game framing of offline RL: A policy actor competes against an adversarially trained value critic, who finds data-consistent scenarios where the actor is inferior to the data-collection behavior policy. We prove that, when the actor attains no regret in the two-player game, running ATAC produces a policy that provably 1) outperforms the behavior policy over a wide range of hyperparameters that control the degree of pessimism, and 2) competes with the best policy covered by data with appropriately chosen hyperparameters. Compared with existing works, notably our framework offers both theoretical guarantees for general function approximation and a deep RL implementation scalable to complex environments and large datasets. In the D4RL benchmark, ATAC consistently outperforms state-of-the-art offline RL algorithms on a range of continuous control tasks.

Cite this Paper


BibTeX
@InProceedings{pmlr-v162-cheng22b, title = {Adversarially Trained Actor Critic for Offline Reinforcement Learning}, author = {Cheng, Ching-An and Xie, Tengyang and Jiang, Nan and Agarwal, Alekh}, booktitle = {Proceedings of the 39th International Conference on Machine Learning}, pages = {3852--3878}, year = {2022}, editor = {Chaudhuri, Kamalika and Jegelka, Stefanie and Song, Le and Szepesvari, Csaba and Niu, Gang and Sabato, Sivan}, volume = {162}, series = {Proceedings of Machine Learning Research}, month = {17--23 Jul}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v162/cheng22b/cheng22b.pdf}, url = {https://proceedings.mlr.press/v162/cheng22b.html}, abstract = {We propose Adversarially Trained Actor Critic (ATAC), a new model-free algorithm for offline reinforcement learning (RL) under insufficient data coverage, based on the concept of relative pessimism. ATAC is designed as a two-player Stackelberg game framing of offline RL: A policy actor competes against an adversarially trained value critic, who finds data-consistent scenarios where the actor is inferior to the data-collection behavior policy. We prove that, when the actor attains no regret in the two-player game, running ATAC produces a policy that provably 1) outperforms the behavior policy over a wide range of hyperparameters that control the degree of pessimism, and 2) competes with the best policy covered by data with appropriately chosen hyperparameters. Compared with existing works, notably our framework offers both theoretical guarantees for general function approximation and a deep RL implementation scalable to complex environments and large datasets. In the D4RL benchmark, ATAC consistently outperforms state-of-the-art offline RL algorithms on a range of continuous control tasks.} }
Endnote
%0 Conference Paper %T Adversarially Trained Actor Critic for Offline Reinforcement Learning %A Ching-An Cheng %A Tengyang Xie %A Nan Jiang %A Alekh Agarwal %B Proceedings of the 39th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2022 %E Kamalika Chaudhuri %E Stefanie Jegelka %E Le Song %E Csaba Szepesvari %E Gang Niu %E Sivan Sabato %F pmlr-v162-cheng22b %I PMLR %P 3852--3878 %U https://proceedings.mlr.press/v162/cheng22b.html %V 162 %X We propose Adversarially Trained Actor Critic (ATAC), a new model-free algorithm for offline reinforcement learning (RL) under insufficient data coverage, based on the concept of relative pessimism. ATAC is designed as a two-player Stackelberg game framing of offline RL: A policy actor competes against an adversarially trained value critic, who finds data-consistent scenarios where the actor is inferior to the data-collection behavior policy. We prove that, when the actor attains no regret in the two-player game, running ATAC produces a policy that provably 1) outperforms the behavior policy over a wide range of hyperparameters that control the degree of pessimism, and 2) competes with the best policy covered by data with appropriately chosen hyperparameters. Compared with existing works, notably our framework offers both theoretical guarantees for general function approximation and a deep RL implementation scalable to complex environments and large datasets. In the D4RL benchmark, ATAC consistently outperforms state-of-the-art offline RL algorithms on a range of continuous control tasks.
APA
Cheng, C., Xie, T., Jiang, N. & Agarwal, A.. (2022). Adversarially Trained Actor Critic for Offline Reinforcement Learning. Proceedings of the 39th International Conference on Machine Learning, in Proceedings of Machine Learning Research 162:3852-3878 Available from https://proceedings.mlr.press/v162/cheng22b.html.

Related Material