Efficient Policy Evaluation with Offline Data Informed Behavior Policy Design

Shuze Liu, Shangtong Zhang
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:32345-32368, 2024.

Abstract

Most reinforcement learning practitioners evaluate their policies with online Monte Carlo estimators for either hyperparameter tuning or testing different algorithmic design choices, where the policy is repeatedly executed in the environment to get the average outcome. Such massive interactions with the environment are prohibitive in many scenarios. In this paper, we propose novel methods that improve the data efficiency of online Monte Carlo estimators while maintaining their unbiasedness. We first propose a tailored closed-form behavior policy that provably reduces the variance of an online Monte Carlo estimator. We then design efficient algorithms to learn this closed-form behavior policy from previously collected offline data. Theoretical analysis is provided to characterize how the behavior policy learning error affects the amount of reduced variance. Compared with previous works, our method achieves better empirical performance in a broader set of environments, with fewer requirements for offline data.

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-liu24ca, title = {Efficient Policy Evaluation with Offline Data Informed Behavior Policy Design}, author = {Liu, Shuze and Zhang, Shangtong}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {32345--32368}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/liu24ca/liu24ca.pdf}, url = {https://proceedings.mlr.press/v235/liu24ca.html}, abstract = {Most reinforcement learning practitioners evaluate their policies with online Monte Carlo estimators for either hyperparameter tuning or testing different algorithmic design choices, where the policy is repeatedly executed in the environment to get the average outcome. Such massive interactions with the environment are prohibitive in many scenarios. In this paper, we propose novel methods that improve the data efficiency of online Monte Carlo estimators while maintaining their unbiasedness. We first propose a tailored closed-form behavior policy that provably reduces the variance of an online Monte Carlo estimator. We then design efficient algorithms to learn this closed-form behavior policy from previously collected offline data. Theoretical analysis is provided to characterize how the behavior policy learning error affects the amount of reduced variance. Compared with previous works, our method achieves better empirical performance in a broader set of environments, with fewer requirements for offline data.} }
Endnote
%0 Conference Paper %T Efficient Policy Evaluation with Offline Data Informed Behavior Policy Design %A Shuze Liu %A Shangtong Zhang %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-liu24ca %I PMLR %P 32345--32368 %U https://proceedings.mlr.press/v235/liu24ca.html %V 235 %X Most reinforcement learning practitioners evaluate their policies with online Monte Carlo estimators for either hyperparameter tuning or testing different algorithmic design choices, where the policy is repeatedly executed in the environment to get the average outcome. Such massive interactions with the environment are prohibitive in many scenarios. In this paper, we propose novel methods that improve the data efficiency of online Monte Carlo estimators while maintaining their unbiasedness. We first propose a tailored closed-form behavior policy that provably reduces the variance of an online Monte Carlo estimator. We then design efficient algorithms to learn this closed-form behavior policy from previously collected offline data. Theoretical analysis is provided to characterize how the behavior policy learning error affects the amount of reduced variance. Compared with previous works, our method achieves better empirical performance in a broader set of environments, with fewer requirements for offline data.
APA
Liu, S. & Zhang, S.. (2024). Efficient Policy Evaluation with Offline Data Informed Behavior Policy Design. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:32345-32368 Available from https://proceedings.mlr.press/v235/liu24ca.html.

Related Material