Off-Policy Risk Assessment for Markov Decision Processes
Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, PMLR 151:5022-5050, 2022.
Addressing such diverse ends as mitigating safety risks, aligning agent behavior with human preferences, and improving the efficiency of learning, an emerging line of reinforcement learning research addresses the entire distribution of returns and various risk functionals that depend upon it. In the contextual bandit setting, recently work on off-policy risk assessment estimates the target policy’s CDF of returns, providing finite sample guarantees that extend to (and hold simultaneously over) plugin estimates of an arbitrarily large set of risk functionals. In this paper, we lift OPRA to Markov decision processes (MDPs), where importance sampling (IS) CDF estimators suffer high variance on longer trajectories due to vanishing (and exploding) importance weights. To mitigate these problems, we incorporate model-based estimation to develop the first doubly robust (DR) estimator for the CDF of returns in MDPs. The DR estimator enjoys significantly less variance and, when the model is well specified, achieves the Cramer-Rao variance lower bound. Moreover, for many risk functionals, the downstream estimates enjoy both lower bias and lower variance. Additionally, we derive the first minimax lower bounds for off-policy CDF and risk estimation, which match our error bounds up to a constant. Finally, we demonstrate the efficacy of our DR CDF estimates experimentally on several different environments.