Model-based Reinforcement Learning for Confounded POMDPs

Mao Hong, Zhengling Qi, Yanxun Xu
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:18668-18710, 2024.

Abstract

We propose a model-based offline reinforcement learning (RL) algorithm for confounded partially observable Markov decision processes (POMDPs) under general function approximations and show it is provably efficient under some technical conditions such as the partial coverage imposed on the offline data distribution. Specifically, we first establish a novel model-based identification result for learning the effect of any action on the reward and future transitions in the confounded POMDP. Using this identification result, we then design a nonparametric two-stage estimation procedure to construct an estimator for off-policy evaluation (OPE), which permits general function approximations. Finally, we learn the optimal policy by performing a conservative policy optimization within the confidence regions based on the proposed estimation procedure for OPE. Under some mild conditions, we establish a finite-sample upper bound on the suboptimality of the learned policy in finding the optimal one, which depends on the sample size and the length of horizons polynomially.

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-hong24d, title = {Model-based Reinforcement Learning for Confounded {POMDP}s}, author = {Hong, Mao and Qi, Zhengling and Xu, Yanxun}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {18668--18710}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/hong24d/hong24d.pdf}, url = {https://proceedings.mlr.press/v235/hong24d.html}, abstract = {We propose a model-based offline reinforcement learning (RL) algorithm for confounded partially observable Markov decision processes (POMDPs) under general function approximations and show it is provably efficient under some technical conditions such as the partial coverage imposed on the offline data distribution. Specifically, we first establish a novel model-based identification result for learning the effect of any action on the reward and future transitions in the confounded POMDP. Using this identification result, we then design a nonparametric two-stage estimation procedure to construct an estimator for off-policy evaluation (OPE), which permits general function approximations. Finally, we learn the optimal policy by performing a conservative policy optimization within the confidence regions based on the proposed estimation procedure for OPE. Under some mild conditions, we establish a finite-sample upper bound on the suboptimality of the learned policy in finding the optimal one, which depends on the sample size and the length of horizons polynomially.} }
Endnote
%0 Conference Paper %T Model-based Reinforcement Learning for Confounded POMDPs %A Mao Hong %A Zhengling Qi %A Yanxun Xu %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-hong24d %I PMLR %P 18668--18710 %U https://proceedings.mlr.press/v235/hong24d.html %V 235 %X We propose a model-based offline reinforcement learning (RL) algorithm for confounded partially observable Markov decision processes (POMDPs) under general function approximations and show it is provably efficient under some technical conditions such as the partial coverage imposed on the offline data distribution. Specifically, we first establish a novel model-based identification result for learning the effect of any action on the reward and future transitions in the confounded POMDP. Using this identification result, we then design a nonparametric two-stage estimation procedure to construct an estimator for off-policy evaluation (OPE), which permits general function approximations. Finally, we learn the optimal policy by performing a conservative policy optimization within the confidence regions based on the proposed estimation procedure for OPE. Under some mild conditions, we establish a finite-sample upper bound on the suboptimality of the learned policy in finding the optimal one, which depends on the sample size and the length of horizons polynomially.
APA
Hong, M., Qi, Z. & Xu, Y.. (2024). Model-based Reinforcement Learning for Confounded POMDPs. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:18668-18710 Available from https://proceedings.mlr.press/v235/hong24d.html.

Related Material