Addressing Optimism Bias in Sequence Modeling for Reinforcement Learning

Adam R Villaflor, Zhe Huang, Swapnil Pande, John M Dolan, Jeff Schneider
Proceedings of the 39th International Conference on Machine Learning, PMLR 162:22270-22283, 2022.

Abstract

Impressive results in natural language processing (NLP) based on the Transformer neural network architecture have inspired researchers to explore viewing offline reinforcement learning (RL) as a generic sequence modeling problem. Recent works based on this paradigm have achieved state-of-the-art results in several of the mostly deterministic offline Atari and D4RL benchmarks. However, because these methods jointly model the states and actions as a single sequencing problem, they struggle to disentangle the effects of the policy and world dynamics on the return. Thus, in adversarial or stochastic environments, these methods lead to overly optimistic behavior that can be dangerous in safety-critical systems like autonomous driving. In this work, we propose a method that addresses this optimism bias by explicitly disentangling the policy and world models, which allows us at test time to search for policies that are robust to multiple possible futures in the environment. We demonstrate our method’s superior performance on a variety of autonomous driving tasks in simulation.

Cite this Paper


BibTeX
@InProceedings{pmlr-v162-villaflor22a, title = {Addressing Optimism Bias in Sequence Modeling for Reinforcement Learning}, author = {Villaflor, Adam R and Huang, Zhe and Pande, Swapnil and Dolan, John M and Schneider, Jeff}, booktitle = {Proceedings of the 39th International Conference on Machine Learning}, pages = {22270--22283}, year = {2022}, editor = {Chaudhuri, Kamalika and Jegelka, Stefanie and Song, Le and Szepesvari, Csaba and Niu, Gang and Sabato, Sivan}, volume = {162}, series = {Proceedings of Machine Learning Research}, month = {17--23 Jul}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v162/villaflor22a/villaflor22a.pdf}, url = {https://proceedings.mlr.press/v162/villaflor22a.html}, abstract = {Impressive results in natural language processing (NLP) based on the Transformer neural network architecture have inspired researchers to explore viewing offline reinforcement learning (RL) as a generic sequence modeling problem. Recent works based on this paradigm have achieved state-of-the-art results in several of the mostly deterministic offline Atari and D4RL benchmarks. However, because these methods jointly model the states and actions as a single sequencing problem, they struggle to disentangle the effects of the policy and world dynamics on the return. Thus, in adversarial or stochastic environments, these methods lead to overly optimistic behavior that can be dangerous in safety-critical systems like autonomous driving. In this work, we propose a method that addresses this optimism bias by explicitly disentangling the policy and world models, which allows us at test time to search for policies that are robust to multiple possible futures in the environment. We demonstrate our method’s superior performance on a variety of autonomous driving tasks in simulation.} }
Endnote
%0 Conference Paper %T Addressing Optimism Bias in Sequence Modeling for Reinforcement Learning %A Adam R Villaflor %A Zhe Huang %A Swapnil Pande %A John M Dolan %A Jeff Schneider %B Proceedings of the 39th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2022 %E Kamalika Chaudhuri %E Stefanie Jegelka %E Le Song %E Csaba Szepesvari %E Gang Niu %E Sivan Sabato %F pmlr-v162-villaflor22a %I PMLR %P 22270--22283 %U https://proceedings.mlr.press/v162/villaflor22a.html %V 162 %X Impressive results in natural language processing (NLP) based on the Transformer neural network architecture have inspired researchers to explore viewing offline reinforcement learning (RL) as a generic sequence modeling problem. Recent works based on this paradigm have achieved state-of-the-art results in several of the mostly deterministic offline Atari and D4RL benchmarks. However, because these methods jointly model the states and actions as a single sequencing problem, they struggle to disentangle the effects of the policy and world dynamics on the return. Thus, in adversarial or stochastic environments, these methods lead to overly optimistic behavior that can be dangerous in safety-critical systems like autonomous driving. In this work, we propose a method that addresses this optimism bias by explicitly disentangling the policy and world models, which allows us at test time to search for policies that are robust to multiple possible futures in the environment. We demonstrate our method’s superior performance on a variety of autonomous driving tasks in simulation.
APA
Villaflor, A.R., Huang, Z., Pande, S., Dolan, J.M. & Schneider, J.. (2022). Addressing Optimism Bias in Sequence Modeling for Reinforcement Learning. Proceedings of the 39th International Conference on Machine Learning, in Proceedings of Machine Learning Research 162:22270-22283 Available from https://proceedings.mlr.press/v162/villaflor22a.html.

Related Material