Provably Efficient Offline Reinforcement Learning for Partially Observable Markov Decision Processes

Hongyi Guo, Qi Cai, Yufeng Zhang, Zhuoran Yang, Zhaoran Wang
Proceedings of the 39th International Conference on Machine Learning, PMLR 162:8016-8038, 2022.

Abstract

We study offline reinforcement learning (RL) for partially observable Markov decision processes (POMDPs) with possibly infinite state and observation spaces. Under the undercompleteness assumption, the optimal policy in such POMDPs are characterized by a class of finite-memory Bellman operators. In the offline setting, estimating these operators directly is challenging due to (i) the large observation space and (ii) insufficient coverage of the offline dataset. To tackle these challenges, we propose a novel algorithm that constructs confidence regions for these Bellman operators via offline estimation of their RKHS embeddings, and returns the final policy via pessimistic planning within the confidence regions. We prove that the proposed algorithm attains an \(\epsilon\)-optimal policy using an offline dataset containing \(\tilde\cO(1 / \epsilon^2)\){episodes}, provided that the behavior policy has good coverage over the optimal trajectory. To our best knowledge, our algorithm is the first provably sample efficient offline algorithm for POMDPs without uniform coverage assumptions.

Cite this Paper


BibTeX
@InProceedings{pmlr-v162-guo22a, title = {Provably Efficient Offline Reinforcement Learning for Partially Observable {M}arkov Decision Processes}, author = {Guo, Hongyi and Cai, Qi and Zhang, Yufeng and Yang, Zhuoran and Wang, Zhaoran}, booktitle = {Proceedings of the 39th International Conference on Machine Learning}, pages = {8016--8038}, year = {2022}, editor = {Chaudhuri, Kamalika and Jegelka, Stefanie and Song, Le and Szepesvari, Csaba and Niu, Gang and Sabato, Sivan}, volume = {162}, series = {Proceedings of Machine Learning Research}, month = {17--23 Jul}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v162/guo22a/guo22a.pdf}, url = {https://proceedings.mlr.press/v162/guo22a.html}, abstract = {We study offline reinforcement learning (RL) for partially observable Markov decision processes (POMDPs) with possibly infinite state and observation spaces. Under the undercompleteness assumption, the optimal policy in such POMDPs are characterized by a class of finite-memory Bellman operators. In the offline setting, estimating these operators directly is challenging due to (i) the large observation space and (ii) insufficient coverage of the offline dataset. To tackle these challenges, we propose a novel algorithm that constructs confidence regions for these Bellman operators via offline estimation of their RKHS embeddings, and returns the final policy via pessimistic planning within the confidence regions. We prove that the proposed algorithm attains an \(\epsilon\)-optimal policy using an offline dataset containing \(\tilde\cO(1 / \epsilon^2)\){episodes}, provided that the behavior policy has good coverage over the optimal trajectory. To our best knowledge, our algorithm is the first provably sample efficient offline algorithm for POMDPs without uniform coverage assumptions.} }
Endnote
%0 Conference Paper %T Provably Efficient Offline Reinforcement Learning for Partially Observable Markov Decision Processes %A Hongyi Guo %A Qi Cai %A Yufeng Zhang %A Zhuoran Yang %A Zhaoran Wang %B Proceedings of the 39th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2022 %E Kamalika Chaudhuri %E Stefanie Jegelka %E Le Song %E Csaba Szepesvari %E Gang Niu %E Sivan Sabato %F pmlr-v162-guo22a %I PMLR %P 8016--8038 %U https://proceedings.mlr.press/v162/guo22a.html %V 162 %X We study offline reinforcement learning (RL) for partially observable Markov decision processes (POMDPs) with possibly infinite state and observation spaces. Under the undercompleteness assumption, the optimal policy in such POMDPs are characterized by a class of finite-memory Bellman operators. In the offline setting, estimating these operators directly is challenging due to (i) the large observation space and (ii) insufficient coverage of the offline dataset. To tackle these challenges, we propose a novel algorithm that constructs confidence regions for these Bellman operators via offline estimation of their RKHS embeddings, and returns the final policy via pessimistic planning within the confidence regions. We prove that the proposed algorithm attains an \(\epsilon\)-optimal policy using an offline dataset containing \(\tilde\cO(1 / \epsilon^2)\){episodes}, provided that the behavior policy has good coverage over the optimal trajectory. To our best knowledge, our algorithm is the first provably sample efficient offline algorithm for POMDPs without uniform coverage assumptions.
APA
Guo, H., Cai, Q., Zhang, Y., Yang, Z. & Wang, Z.. (2022). Provably Efficient Offline Reinforcement Learning for Partially Observable Markov Decision Processes. Proceedings of the 39th International Conference on Machine Learning, in Proceedings of Machine Learning Research 162:8016-8038 Available from https://proceedings.mlr.press/v162/guo22a.html.

Related Material