A Minimax Learning Approach to Off-Policy Evaluation in Confounded Partially Observable Markov Decision Processes

Chengchun Shi, Masatoshi Uehara, Jiawei Huang, Nan Jiang
Proceedings of the 39th International Conference on Machine Learning, PMLR 162:20057-20094, 2022.

Abstract

We consider off-policy evaluation (OPE) in Partially Observable Markov Decision Processes (POMDPs), where the evaluation policy depends only on observable variables and the behavior policy depends on unobservable latent variables. Existing works either assume no unmeasured confounders, or focus on settings where both the observation and the state spaces are tabular. In this work, we first propose novel identification methods for OPE in POMDPs with latent confounders, by introducing bridge functions that link the target policy’s value and the observed data distribution. We next propose minimax estimation methods for learning these bridge functions, and construct three estimators based on these estimated bridge functions, corresponding to a value function-based estimator, a marginalized importance sampling estimator, and a doubly-robust estimator. Our proposal permits general function approximation and is thus applicable to settings with continuous or large observation/state spaces. The nonasymptotic and asymptotic properties of the proposed estimators are investigated in detail. A Python implementation of our proposal is available at https://github.com/jiaweihhuang/ Confounded-POMDP-Exp.

Cite this Paper


BibTeX
@InProceedings{pmlr-v162-shi22f, title = {A Minimax Learning Approach to Off-Policy Evaluation in Confounded Partially Observable {M}arkov Decision Processes}, author = {Shi, Chengchun and Uehara, Masatoshi and Huang, Jiawei and Jiang, Nan}, booktitle = {Proceedings of the 39th International Conference on Machine Learning}, pages = {20057--20094}, year = {2022}, editor = {Chaudhuri, Kamalika and Jegelka, Stefanie and Song, Le and Szepesvari, Csaba and Niu, Gang and Sabato, Sivan}, volume = {162}, series = {Proceedings of Machine Learning Research}, month = {17--23 Jul}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v162/shi22f/shi22f.pdf}, url = {https://proceedings.mlr.press/v162/shi22f.html}, abstract = {We consider off-policy evaluation (OPE) in Partially Observable Markov Decision Processes (POMDPs), where the evaluation policy depends only on observable variables and the behavior policy depends on unobservable latent variables. Existing works either assume no unmeasured confounders, or focus on settings where both the observation and the state spaces are tabular. In this work, we first propose novel identification methods for OPE in POMDPs with latent confounders, by introducing bridge functions that link the target policy’s value and the observed data distribution. We next propose minimax estimation methods for learning these bridge functions, and construct three estimators based on these estimated bridge functions, corresponding to a value function-based estimator, a marginalized importance sampling estimator, and a doubly-robust estimator. Our proposal permits general function approximation and is thus applicable to settings with continuous or large observation/state spaces. The nonasymptotic and asymptotic properties of the proposed estimators are investigated in detail. A Python implementation of our proposal is available at https://github.com/jiaweihhuang/ Confounded-POMDP-Exp.} }
Endnote
%0 Conference Paper %T A Minimax Learning Approach to Off-Policy Evaluation in Confounded Partially Observable Markov Decision Processes %A Chengchun Shi %A Masatoshi Uehara %A Jiawei Huang %A Nan Jiang %B Proceedings of the 39th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2022 %E Kamalika Chaudhuri %E Stefanie Jegelka %E Le Song %E Csaba Szepesvari %E Gang Niu %E Sivan Sabato %F pmlr-v162-shi22f %I PMLR %P 20057--20094 %U https://proceedings.mlr.press/v162/shi22f.html %V 162 %X We consider off-policy evaluation (OPE) in Partially Observable Markov Decision Processes (POMDPs), where the evaluation policy depends only on observable variables and the behavior policy depends on unobservable latent variables. Existing works either assume no unmeasured confounders, or focus on settings where both the observation and the state spaces are tabular. In this work, we first propose novel identification methods for OPE in POMDPs with latent confounders, by introducing bridge functions that link the target policy’s value and the observed data distribution. We next propose minimax estimation methods for learning these bridge functions, and construct three estimators based on these estimated bridge functions, corresponding to a value function-based estimator, a marginalized importance sampling estimator, and a doubly-robust estimator. Our proposal permits general function approximation and is thus applicable to settings with continuous or large observation/state spaces. The nonasymptotic and asymptotic properties of the proposed estimators are investigated in detail. A Python implementation of our proposal is available at https://github.com/jiaweihhuang/ Confounded-POMDP-Exp.
APA
Shi, C., Uehara, M., Huang, J. & Jiang, N.. (2022). A Minimax Learning Approach to Off-Policy Evaluation in Confounded Partially Observable Markov Decision Processes. Proceedings of the 39th International Conference on Machine Learning, in Proceedings of Machine Learning Research 162:20057-20094 Available from https://proceedings.mlr.press/v162/shi22f.html.

Related Material