Coordinate Ascent for Off-Policy RL with Global Convergence Guarantees

Hsin-En Su, Yen-Ju Chen, Ping-Chun Hsieh, Xi Liu
Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, PMLR 206:1331-1378, 2023.

Abstract

We revisit the domain of off-policy policy optimization in RL from the perspective of coordinate ascent. One commonly-used approach is to leverage the off-policy policy gradient to optimize a surrogate objective – the total discounted in expectation return of the target policy with respect to the state distribution of the behavior policy. However, this approach has been shown to suffer from the distribution mismatch issue, and therefore significant efforts are needed for correcting this mismatch either via state distribution correction or a counterfactual method. In this paper, we rethink off-policy learning via Coordinate Ascent Policy Optimization (CAPO), an off-policy actor-critic algorithm that decouples policy improvement from the state distribution of the behavior policy without using the policy gradient. This design obviates the need for distribution correction or importance sampling in the policy improvement step of off-policy policy gradient. We establish the global convergence of CAPO with general coordinate selection and then further quantify the convergence rates of several instances of CAPO with popular coordinate selection rules, including the cyclic and the randomized variants of CAPO. We then extend CAPO to neural policies for a more practical implementation. Through experiments, we demonstrate that CAPO provides a competitive approach to RL in practice.

Cite this Paper


BibTeX
@InProceedings{pmlr-v206-su23a, title = {Coordinate Ascent for Off-Policy RL with Global Convergence Guarantees}, author = {Su, Hsin-En and Chen, Yen-Ju and Hsieh, Ping-Chun and Liu, Xi}, booktitle = {Proceedings of The 26th International Conference on Artificial Intelligence and Statistics}, pages = {1331--1378}, year = {2023}, editor = {Ruiz, Francisco and Dy, Jennifer and van de Meent, Jan-Willem}, volume = {206}, series = {Proceedings of Machine Learning Research}, month = {25--27 Apr}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v206/su23a/su23a.pdf}, url = {https://proceedings.mlr.press/v206/su23a.html}, abstract = {We revisit the domain of off-policy policy optimization in RL from the perspective of coordinate ascent. One commonly-used approach is to leverage the off-policy policy gradient to optimize a surrogate objective – the total discounted in expectation return of the target policy with respect to the state distribution of the behavior policy. However, this approach has been shown to suffer from the distribution mismatch issue, and therefore significant efforts are needed for correcting this mismatch either via state distribution correction or a counterfactual method. In this paper, we rethink off-policy learning via Coordinate Ascent Policy Optimization (CAPO), an off-policy actor-critic algorithm that decouples policy improvement from the state distribution of the behavior policy without using the policy gradient. This design obviates the need for distribution correction or importance sampling in the policy improvement step of off-policy policy gradient. We establish the global convergence of CAPO with general coordinate selection and then further quantify the convergence rates of several instances of CAPO with popular coordinate selection rules, including the cyclic and the randomized variants of CAPO. We then extend CAPO to neural policies for a more practical implementation. Through experiments, we demonstrate that CAPO provides a competitive approach to RL in practice.} }
Endnote
%0 Conference Paper %T Coordinate Ascent for Off-Policy RL with Global Convergence Guarantees %A Hsin-En Su %A Yen-Ju Chen %A Ping-Chun Hsieh %A Xi Liu %B Proceedings of The 26th International Conference on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 2023 %E Francisco Ruiz %E Jennifer Dy %E Jan-Willem van de Meent %F pmlr-v206-su23a %I PMLR %P 1331--1378 %U https://proceedings.mlr.press/v206/su23a.html %V 206 %X We revisit the domain of off-policy policy optimization in RL from the perspective of coordinate ascent. One commonly-used approach is to leverage the off-policy policy gradient to optimize a surrogate objective – the total discounted in expectation return of the target policy with respect to the state distribution of the behavior policy. However, this approach has been shown to suffer from the distribution mismatch issue, and therefore significant efforts are needed for correcting this mismatch either via state distribution correction or a counterfactual method. In this paper, we rethink off-policy learning via Coordinate Ascent Policy Optimization (CAPO), an off-policy actor-critic algorithm that decouples policy improvement from the state distribution of the behavior policy without using the policy gradient. This design obviates the need for distribution correction or importance sampling in the policy improvement step of off-policy policy gradient. We establish the global convergence of CAPO with general coordinate selection and then further quantify the convergence rates of several instances of CAPO with popular coordinate selection rules, including the cyclic and the randomized variants of CAPO. We then extend CAPO to neural policies for a more practical implementation. Through experiments, we demonstrate that CAPO provides a competitive approach to RL in practice.
APA
Su, H., Chen, Y., Hsieh, P. & Liu, X.. (2023). Coordinate Ascent for Off-Policy RL with Global Convergence Guarantees. Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 206:1331-1378 Available from https://proceedings.mlr.press/v206/su23a.html.

Related Material