Sample Complexity of Policy-Based Methods under Off-Policy Sampling and Linear Function Approximation

Zaiwei Chen, Siva Theja Maguluri
Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, PMLR 151:11195-11214, 2022.

Abstract

In this work, we study policy-based methods for solving the reinforcement learning problem, where off-policy sampling and linear function approximation are employed for policy evaluation, and various policy update rules (including natural policy gradient) are considered for policy improvement. To solve the policy evaluation sub-problem in the presence of the deadly triad, we propose a generic algorithm framework of multi-step TD-learning with generalized importance sampling ratios, which includes two specific algorithms: the $\lambda$-averaged $Q$-trace and the two-sided $Q$-trace. The generic algorithm is single time-scale, has provable finite-sample guarantees, and overcomes the high variance issue in off-policy learning. As for the policy improvement, we provide a universal analysis that establishes geometric convergence of various policy update rules, which leads to an overall $\Tilde{\mathcal{O}}(\epsilon^{-2})$ sample complexity.

Cite this Paper


BibTeX
@InProceedings{pmlr-v151-chen22i, title = { Sample Complexity of Policy-Based Methods under Off-Policy Sampling and Linear Function Approximation }, author = {Chen, Zaiwei and Theja Maguluri, Siva}, booktitle = {Proceedings of The 25th International Conference on Artificial Intelligence and Statistics}, pages = {11195--11214}, year = {2022}, editor = {Camps-Valls, Gustau and Ruiz, Francisco J. R. and Valera, Isabel}, volume = {151}, series = {Proceedings of Machine Learning Research}, month = {28--30 Mar}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v151/chen22i/chen22i.pdf}, url = {https://proceedings.mlr.press/v151/chen22i.html}, abstract = { In this work, we study policy-based methods for solving the reinforcement learning problem, where off-policy sampling and linear function approximation are employed for policy evaluation, and various policy update rules (including natural policy gradient) are considered for policy improvement. To solve the policy evaluation sub-problem in the presence of the deadly triad, we propose a generic algorithm framework of multi-step TD-learning with generalized importance sampling ratios, which includes two specific algorithms: the $\lambda$-averaged $Q$-trace and the two-sided $Q$-trace. The generic algorithm is single time-scale, has provable finite-sample guarantees, and overcomes the high variance issue in off-policy learning. As for the policy improvement, we provide a universal analysis that establishes geometric convergence of various policy update rules, which leads to an overall $\Tilde{\mathcal{O}}(\epsilon^{-2})$ sample complexity. } }
Endnote
%0 Conference Paper %T Sample Complexity of Policy-Based Methods under Off-Policy Sampling and Linear Function Approximation %A Zaiwei Chen %A Siva Theja Maguluri %B Proceedings of The 25th International Conference on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 2022 %E Gustau Camps-Valls %E Francisco J. R. Ruiz %E Isabel Valera %F pmlr-v151-chen22i %I PMLR %P 11195--11214 %U https://proceedings.mlr.press/v151/chen22i.html %V 151 %X In this work, we study policy-based methods for solving the reinforcement learning problem, where off-policy sampling and linear function approximation are employed for policy evaluation, and various policy update rules (including natural policy gradient) are considered for policy improvement. To solve the policy evaluation sub-problem in the presence of the deadly triad, we propose a generic algorithm framework of multi-step TD-learning with generalized importance sampling ratios, which includes two specific algorithms: the $\lambda$-averaged $Q$-trace and the two-sided $Q$-trace. The generic algorithm is single time-scale, has provable finite-sample guarantees, and overcomes the high variance issue in off-policy learning. As for the policy improvement, we provide a universal analysis that establishes geometric convergence of various policy update rules, which leads to an overall $\Tilde{\mathcal{O}}(\epsilon^{-2})$ sample complexity.
APA
Chen, Z. & Theja Maguluri, S.. (2022). Sample Complexity of Policy-Based Methods under Off-Policy Sampling and Linear Function Approximation . Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 151:11195-11214 Available from https://proceedings.mlr.press/v151/chen22i.html.

Related Material