A Tale of Sampling and Estimation in Discounted Reinforcement Learning

Alberto Maria Metelli, Mirco Mutti, Marcello Restelli
Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, PMLR 206:4575-4601, 2023.

Abstract

The most relevant problems in discounted reinforcement learning involve estimating the mean of a function under the stationary distribution of a Markov reward process, such as the expected return in policy evaluation, or the policy gradient in policy optimization. In practice, these estimates are produced through a finite-horizon episodic sampling, which neglects the mixing properties of the Markov process. It is mostly unclear how this mismatch between the practical and the ideal setting affects the estimation, and the literature lacks a formal study on the pitfalls of episodic sampling, and how to do it optimally. In this paper, we present a minimax lower bound on the discounted mean estimation problem that explicitly connects the estimation error with the mixing properties of the Markov process and the discount factor. Then, we provide a statistical analysis on a set of notable estimators and the corresponding sampling procedures, which includes the finite-horizon estimators often used in practice. Crucially, we show that estimating the mean by directly sampling from the discounted kernel of the Markov process brings compelling statistical properties w.r.t. the alternative estimators, as it matches the lower bound without requiring a careful tuning of the episode horizon.

Cite this Paper


BibTeX
@InProceedings{pmlr-v206-metelli23a, title = {A Tale of Sampling and Estimation in Discounted Reinforcement Learning}, author = {Metelli, Alberto Maria and Mutti, Mirco and Restelli, Marcello}, booktitle = {Proceedings of The 26th International Conference on Artificial Intelligence and Statistics}, pages = {4575--4601}, year = {2023}, editor = {Ruiz, Francisco and Dy, Jennifer and van de Meent, Jan-Willem}, volume = {206}, series = {Proceedings of Machine Learning Research}, month = {25--27 Apr}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v206/metelli23a/metelli23a.pdf}, url = {https://proceedings.mlr.press/v206/metelli23a.html}, abstract = {The most relevant problems in discounted reinforcement learning involve estimating the mean of a function under the stationary distribution of a Markov reward process, such as the expected return in policy evaluation, or the policy gradient in policy optimization. In practice, these estimates are produced through a finite-horizon episodic sampling, which neglects the mixing properties of the Markov process. It is mostly unclear how this mismatch between the practical and the ideal setting affects the estimation, and the literature lacks a formal study on the pitfalls of episodic sampling, and how to do it optimally. In this paper, we present a minimax lower bound on the discounted mean estimation problem that explicitly connects the estimation error with the mixing properties of the Markov process and the discount factor. Then, we provide a statistical analysis on a set of notable estimators and the corresponding sampling procedures, which includes the finite-horizon estimators often used in practice. Crucially, we show that estimating the mean by directly sampling from the discounted kernel of the Markov process brings compelling statistical properties w.r.t. the alternative estimators, as it matches the lower bound without requiring a careful tuning of the episode horizon.} }
Endnote
%0 Conference Paper %T A Tale of Sampling and Estimation in Discounted Reinforcement Learning %A Alberto Maria Metelli %A Mirco Mutti %A Marcello Restelli %B Proceedings of The 26th International Conference on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 2023 %E Francisco Ruiz %E Jennifer Dy %E Jan-Willem van de Meent %F pmlr-v206-metelli23a %I PMLR %P 4575--4601 %U https://proceedings.mlr.press/v206/metelli23a.html %V 206 %X The most relevant problems in discounted reinforcement learning involve estimating the mean of a function under the stationary distribution of a Markov reward process, such as the expected return in policy evaluation, or the policy gradient in policy optimization. In practice, these estimates are produced through a finite-horizon episodic sampling, which neglects the mixing properties of the Markov process. It is mostly unclear how this mismatch between the practical and the ideal setting affects the estimation, and the literature lacks a formal study on the pitfalls of episodic sampling, and how to do it optimally. In this paper, we present a minimax lower bound on the discounted mean estimation problem that explicitly connects the estimation error with the mixing properties of the Markov process and the discount factor. Then, we provide a statistical analysis on a set of notable estimators and the corresponding sampling procedures, which includes the finite-horizon estimators often used in practice. Crucially, we show that estimating the mean by directly sampling from the discounted kernel of the Markov process brings compelling statistical properties w.r.t. the alternative estimators, as it matches the lower bound without requiring a careful tuning of the episode horizon.
APA
Metelli, A.M., Mutti, M. & Restelli, M.. (2023). A Tale of Sampling and Estimation in Discounted Reinforcement Learning. Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 206:4575-4601 Available from https://proceedings.mlr.press/v206/metelli23a.html.

Related Material