On the Convergence Rate of Off-Policy Policy Optimization Methods with Density-Ratio Correction

Jiawei Huang; Nan Jiang

On the Convergence Rate of Off-Policy Policy Optimization Methods with Density-Ratio Correction

Jiawei Huang, Nan Jiang

Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, PMLR 151:2658-2705, 2022.

Abstract

In this paper, we study the convergence properties of off-policy policy optimization algorithms with state-action density ratio correction under function approximation setting, where the objective function is formulated as a max-max-min problem. We first clearly characterize the bias of the learning objective, and then present two strategies with finite-time convergence guarantees. In our first strategy, we propose an algorithm called P-SREDA with convergence rate

$O(\epsilon^{-3})$ , whose dependency on

$\epsilon$ is optimal. Besides, in our second strategy, we design a new off-policy actor-critic style algorithm named O-SPIM. We prove that O-SPIM converges to a stationary point with total complexity

$O(\epsilon^{-4})$ , which matches the convergence rate of some recent actor-critic algorithms in the on-policy setting.

Cite this Paper

BibTeX


@InProceedings{pmlr-v151-huang22a,
  title = 	 { On the Convergence Rate of Off-Policy Policy Optimization Methods with Density-Ratio Correction },
  author =       {Huang, Jiawei and Jiang, Nan},
  booktitle = 	 {Proceedings of The 25th International Conference on Artificial Intelligence and Statistics},
  pages = 	 {2658--2705},
  year = 	 {2022},
  editor = 	 {Camps-Valls, Gustau and Ruiz, Francisco J. R. and Valera, Isabel},
  volume = 	 {151},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {28--30 Mar},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v151/huang22a/huang22a.pdf},
  url = 	 {https://proceedings.mlr.press/v151/huang22a.html},
  abstract = 	 { In this paper, we study the convergence properties of off-policy policy optimization algorithms with state-action density ratio correction under function approximation setting, where the objective function is formulated as a max-max-min problem. We first clearly characterize the bias of the learning objective, and then present two strategies with finite-time convergence guarantees. In our first strategy, we propose an algorithm called P-SREDA with convergence rate $O(\epsilon^{-3})$, whose dependency on $\epsilon$ is optimal. Besides, in our second strategy, we design a new off-policy actor-critic style algorithm named O-SPIM. We prove that O-SPIM converges to a stationary point with total complexity $O(\epsilon^{-4})$, which matches the convergence rate of some recent actor-critic algorithms in the on-policy setting. }
}

Endnote

%0 Conference Paper
%T  On the Convergence Rate of Off-Policy Policy Optimization Methods with Density-Ratio Correction 
%A Jiawei Huang
%A Nan Jiang
%B Proceedings of The 25th International Conference on Artificial Intelligence and Statistics
%C Proceedings of Machine Learning Research
%D 2022
%E Gustau Camps-Valls
%E Francisco J. R. Ruiz
%E Isabel Valera	
%F pmlr-v151-huang22a
%I PMLR
%P 2658--2705
%U https://proceedings.mlr.press/v151/huang22a.html
%V 151
%X  In this paper, we study the convergence properties of off-policy policy optimization algorithms with state-action density ratio correction under function approximation setting, where the objective function is formulated as a max-max-min problem. We first clearly characterize the bias of the learning objective, and then present two strategies with finite-time convergence guarantees. In our first strategy, we propose an algorithm called P-SREDA with convergence rate $O(\epsilon^{-3})$, whose dependency on $\epsilon$ is optimal. Besides, in our second strategy, we design a new off-policy actor-critic style algorithm named O-SPIM. We prove that O-SPIM converges to a stationary point with total complexity $O(\epsilon^{-4})$, which matches the convergence rate of some recent actor-critic algorithms in the on-policy setting.

APA


Huang, J. & Jiang, N.. (2022).  On the Convergence Rate of Off-Policy Policy Optimization Methods with Density-Ratio Correction . Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 151:2658-2705 Available from https://proceedings.mlr.press/v151/huang22a.html.

Related Material

Download PDF