On the Relation between Policy Improvement and Off-Policy Minimum-Variance Policy Evaluation

Alberto Maria Metelli, Samuele Meta, Marcello Restelli
Proceedings of the Thirty-Ninth Conference on Uncertainty in Artificial Intelligence, PMLR 216:1423-1433, 2023.

Abstract

Off-policy methods are the basis of a large number of effective Policy Optimization (PO) algorithms. In this setting, Importance Sampling (IS) is typically employed for off-policy evaluation, with the goal of estimating the performance of a target policy, given samples collected with a different behavioral policy. However, in Monte Carlo simulation, IS represents a variance minimization approach. In this field, a suitable behavioral distribution is employed for sampling, allowing diminishing the variance of the estimator below the one achievable when sampling from the target distribution. In this paper, we analyze IS in these two guises in the context of PO. We provide a novel view of off-policy PO, showing a connection between the policy improvement and variance minimization objectives. Then, we illustrate how minimizing the off-policy variance can, in some circumstances, lead to a policy improvement, with the advantage, compared with direct off-policy learning, of implicitly enforcing a trust region. Finally, we present numerical simulations on continuous RL benchmarks, with a particular focus on the robustness to small batch sizes.

Cite this Paper


BibTeX
@InProceedings{pmlr-v216-metelli23a, title = {On the Relation between Policy Improvement and Off-Policy Minimum-Variance Policy Evaluation}, author = {Metelli, Alberto Maria and Meta, Samuele and Restelli, Marcello}, booktitle = {Proceedings of the Thirty-Ninth Conference on Uncertainty in Artificial Intelligence}, pages = {1423--1433}, year = {2023}, editor = {Evans, Robin J. and Shpitser, Ilya}, volume = {216}, series = {Proceedings of Machine Learning Research}, month = {31 Jul--04 Aug}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v216/metelli23a/metelli23a.pdf}, url = {https://proceedings.mlr.press/v216/metelli23a.html}, abstract = {Off-policy methods are the basis of a large number of effective Policy Optimization (PO) algorithms. In this setting, Importance Sampling (IS) is typically employed for off-policy evaluation, with the goal of estimating the performance of a target policy, given samples collected with a different behavioral policy. However, in Monte Carlo simulation, IS represents a variance minimization approach. In this field, a suitable behavioral distribution is employed for sampling, allowing diminishing the variance of the estimator below the one achievable when sampling from the target distribution. In this paper, we analyze IS in these two guises in the context of PO. We provide a novel view of off-policy PO, showing a connection between the policy improvement and variance minimization objectives. Then, we illustrate how minimizing the off-policy variance can, in some circumstances, lead to a policy improvement, with the advantage, compared with direct off-policy learning, of implicitly enforcing a trust region. Finally, we present numerical simulations on continuous RL benchmarks, with a particular focus on the robustness to small batch sizes.} }
Endnote
%0 Conference Paper %T On the Relation between Policy Improvement and Off-Policy Minimum-Variance Policy Evaluation %A Alberto Maria Metelli %A Samuele Meta %A Marcello Restelli %B Proceedings of the Thirty-Ninth Conference on Uncertainty in Artificial Intelligence %C Proceedings of Machine Learning Research %D 2023 %E Robin J. Evans %E Ilya Shpitser %F pmlr-v216-metelli23a %I PMLR %P 1423--1433 %U https://proceedings.mlr.press/v216/metelli23a.html %V 216 %X Off-policy methods are the basis of a large number of effective Policy Optimization (PO) algorithms. In this setting, Importance Sampling (IS) is typically employed for off-policy evaluation, with the goal of estimating the performance of a target policy, given samples collected with a different behavioral policy. However, in Monte Carlo simulation, IS represents a variance minimization approach. In this field, a suitable behavioral distribution is employed for sampling, allowing diminishing the variance of the estimator below the one achievable when sampling from the target distribution. In this paper, we analyze IS in these two guises in the context of PO. We provide a novel view of off-policy PO, showing a connection between the policy improvement and variance minimization objectives. Then, we illustrate how minimizing the off-policy variance can, in some circumstances, lead to a policy improvement, with the advantage, compared with direct off-policy learning, of implicitly enforcing a trust region. Finally, we present numerical simulations on continuous RL benchmarks, with a particular focus on the robustness to small batch sizes.
APA
Metelli, A.M., Meta, S. & Restelli, M.. (2023). On the Relation between Policy Improvement and Off-Policy Minimum-Variance Policy Evaluation. Proceedings of the Thirty-Ninth Conference on Uncertainty in Artificial Intelligence, in Proceedings of Machine Learning Research 216:1423-1433 Available from https://proceedings.mlr.press/v216/metelli23a.html.

Related Material