Pessimistic Q-Learning for Offline Reinforcement Learning: Towards Optimal Sample Complexity

Laixi Shi, Gen Li, Yuting Wei, Yuxin Chen, Yuejie Chi
Proceedings of the 39th International Conference on Machine Learning, PMLR 162:19967-20025, 2022.

Abstract

Offline or batch reinforcement learning seeks to learn a near-optimal policy using history data without active exploration of the environment. To counter the insufficient coverage and sample scarcity of many offline datasets, the principle of pessimism has been recently introduced to mitigate high bias of the estimated values. While pessimistic variants of model-based algorithms (e.g., value iteration with lower confidence bounds) have been theoretically investigated, their model-free counterparts — which do not require explicit model estimation — have not been adequately studied, especially in terms of sample efficiency. To address this inadequacy, we study a pessimistic variant of Q-learning in the context of finite-horizon Markov decision processes, and characterize its sample complexity under the single-policy concentrability assumption which does not require the full coverage of the state-action space. In addition, a variance-reduced pessimistic Q-learning algorithm is proposed to achieve near-optimal sample complexity. Altogether, this work highlights the efficiency of model-free algorithms in offline RL when used in conjunction with pessimism and variance reduction.

Cite this Paper


BibTeX
@InProceedings{pmlr-v162-shi22c, title = {Pessimistic Q-Learning for Offline Reinforcement Learning: Towards Optimal Sample Complexity}, author = {Shi, Laixi and Li, Gen and Wei, Yuting and Chen, Yuxin and Chi, Yuejie}, booktitle = {Proceedings of the 39th International Conference on Machine Learning}, pages = {19967--20025}, year = {2022}, editor = {Chaudhuri, Kamalika and Jegelka, Stefanie and Song, Le and Szepesvari, Csaba and Niu, Gang and Sabato, Sivan}, volume = {162}, series = {Proceedings of Machine Learning Research}, month = {17--23 Jul}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v162/shi22c/shi22c.pdf}, url = {https://proceedings.mlr.press/v162/shi22c.html}, abstract = {Offline or batch reinforcement learning seeks to learn a near-optimal policy using history data without active exploration of the environment. To counter the insufficient coverage and sample scarcity of many offline datasets, the principle of pessimism has been recently introduced to mitigate high bias of the estimated values. While pessimistic variants of model-based algorithms (e.g., value iteration with lower confidence bounds) have been theoretically investigated, their model-free counterparts — which do not require explicit model estimation — have not been adequately studied, especially in terms of sample efficiency. To address this inadequacy, we study a pessimistic variant of Q-learning in the context of finite-horizon Markov decision processes, and characterize its sample complexity under the single-policy concentrability assumption which does not require the full coverage of the state-action space. In addition, a variance-reduced pessimistic Q-learning algorithm is proposed to achieve near-optimal sample complexity. Altogether, this work highlights the efficiency of model-free algorithms in offline RL when used in conjunction with pessimism and variance reduction.} }
Endnote
%0 Conference Paper %T Pessimistic Q-Learning for Offline Reinforcement Learning: Towards Optimal Sample Complexity %A Laixi Shi %A Gen Li %A Yuting Wei %A Yuxin Chen %A Yuejie Chi %B Proceedings of the 39th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2022 %E Kamalika Chaudhuri %E Stefanie Jegelka %E Le Song %E Csaba Szepesvari %E Gang Niu %E Sivan Sabato %F pmlr-v162-shi22c %I PMLR %P 19967--20025 %U https://proceedings.mlr.press/v162/shi22c.html %V 162 %X Offline or batch reinforcement learning seeks to learn a near-optimal policy using history data without active exploration of the environment. To counter the insufficient coverage and sample scarcity of many offline datasets, the principle of pessimism has been recently introduced to mitigate high bias of the estimated values. While pessimistic variants of model-based algorithms (e.g., value iteration with lower confidence bounds) have been theoretically investigated, their model-free counterparts — which do not require explicit model estimation — have not been adequately studied, especially in terms of sample efficiency. To address this inadequacy, we study a pessimistic variant of Q-learning in the context of finite-horizon Markov decision processes, and characterize its sample complexity under the single-policy concentrability assumption which does not require the full coverage of the state-action space. In addition, a variance-reduced pessimistic Q-learning algorithm is proposed to achieve near-optimal sample complexity. Altogether, this work highlights the efficiency of model-free algorithms in offline RL when used in conjunction with pessimism and variance reduction.
APA
Shi, L., Li, G., Wei, Y., Chen, Y. & Chi, Y.. (2022). Pessimistic Q-Learning for Offline Reinforcement Learning: Towards Optimal Sample Complexity. Proceedings of the 39th International Conference on Machine Learning, in Proceedings of Machine Learning Research 162:19967-20025 Available from https://proceedings.mlr.press/v162/shi22c.html.

Related Material