Cautiously Optimistic Policy Optimization and Exploration with Linear Function Approximation

Andrea Zanette, Ching-An Cheng, Alekh Agarwal
Proceedings of Thirty Fourth Conference on Learning Theory, PMLR 134:4473-4525, 2021.

Abstract

Policy optimization methods are popular reinforcement learning (RL) algorithms, because their incremental and on-policy nature makes them more stable than the value-based counterparts. However, the same properties also make them slow to converge and sample inefficient, as the on-policy nature is not amenable to data reuse and the incremental updates result in a large iteration complexity before a near optimal policy is discovered. These empirical findings are mirrored in theory in the recent work of \citet{agarwal2020pc}, which provides a policy optimization method PC-PG that provably finds near optimal policies in the linear MDP model and is robust to model misspecification, but suffers from an extremely poor sample complexity compared with value-based techniques. In this paper, we propose a new algorithm, COPOE, that overcomes the poor sample complexity of PC-PG while retaining the robustness to model misspecification. COPOE makes several important algorithmic enhancements of PC-PG, such as enabling data reuse, and uses more refined analysis techniques, which we expect to be more broadly applicable to designing new RL algorithms. The result is an improvement in sample complexity from $\widetilde{O}(1/\epsilon^{11})$ for PC-PG to $\widetilde{O}(1/\epsilon^3)$ for COPOE, nearly bridging the gap with value-based techniques.

Cite this Paper


BibTeX
@InProceedings{pmlr-v134-zanette21a, title = {Cautiously Optimistic Policy Optimization and Exploration with Linear Function Approximation}, author = {Zanette, Andrea and Cheng, Ching-An and Agarwal, Alekh}, booktitle = {Proceedings of Thirty Fourth Conference on Learning Theory}, pages = {4473--4525}, year = {2021}, editor = {Belkin, Mikhail and Kpotufe, Samory}, volume = {134}, series = {Proceedings of Machine Learning Research}, month = {15--19 Aug}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v134/zanette21a/zanette21a.pdf}, url = {https://proceedings.mlr.press/v134/zanette21a.html}, abstract = {Policy optimization methods are popular reinforcement learning (RL) algorithms, because their incremental and on-policy nature makes them more stable than the value-based counterparts. However, the same properties also make them slow to converge and sample inefficient, as the on-policy nature is not amenable to data reuse and the incremental updates result in a large iteration complexity before a near optimal policy is discovered. These empirical findings are mirrored in theory in the recent work of \citet{agarwal2020pc}, which provides a policy optimization method PC-PG that provably finds near optimal policies in the linear MDP model and is robust to model misspecification, but suffers from an extremely poor sample complexity compared with value-based techniques. In this paper, we propose a new algorithm, COPOE, that overcomes the poor sample complexity of PC-PG while retaining the robustness to model misspecification. COPOE makes several important algorithmic enhancements of PC-PG, such as enabling data reuse, and uses more refined analysis techniques, which we expect to be more broadly applicable to designing new RL algorithms. The result is an improvement in sample complexity from $\widetilde{O}(1/\epsilon^{11})$ for PC-PG to $\widetilde{O}(1/\epsilon^3)$ for COPOE, nearly bridging the gap with value-based techniques.} }
Endnote
%0 Conference Paper %T Cautiously Optimistic Policy Optimization and Exploration with Linear Function Approximation %A Andrea Zanette %A Ching-An Cheng %A Alekh Agarwal %B Proceedings of Thirty Fourth Conference on Learning Theory %C Proceedings of Machine Learning Research %D 2021 %E Mikhail Belkin %E Samory Kpotufe %F pmlr-v134-zanette21a %I PMLR %P 4473--4525 %U https://proceedings.mlr.press/v134/zanette21a.html %V 134 %X Policy optimization methods are popular reinforcement learning (RL) algorithms, because their incremental and on-policy nature makes them more stable than the value-based counterparts. However, the same properties also make them slow to converge and sample inefficient, as the on-policy nature is not amenable to data reuse and the incremental updates result in a large iteration complexity before a near optimal policy is discovered. These empirical findings are mirrored in theory in the recent work of \citet{agarwal2020pc}, which provides a policy optimization method PC-PG that provably finds near optimal policies in the linear MDP model and is robust to model misspecification, but suffers from an extremely poor sample complexity compared with value-based techniques. In this paper, we propose a new algorithm, COPOE, that overcomes the poor sample complexity of PC-PG while retaining the robustness to model misspecification. COPOE makes several important algorithmic enhancements of PC-PG, such as enabling data reuse, and uses more refined analysis techniques, which we expect to be more broadly applicable to designing new RL algorithms. The result is an improvement in sample complexity from $\widetilde{O}(1/\epsilon^{11})$ for PC-PG to $\widetilde{O}(1/\epsilon^3)$ for COPOE, nearly bridging the gap with value-based techniques.
APA
Zanette, A., Cheng, C. & Agarwal, A.. (2021). Cautiously Optimistic Policy Optimization and Exploration with Linear Function Approximation. Proceedings of Thirty Fourth Conference on Learning Theory, in Proceedings of Machine Learning Research 134:4473-4525 Available from https://proceedings.mlr.press/v134/zanette21a.html.

Related Material