Provably Efficient Exploration in Policy Optimization

Qi Cai; Zhuoran Yang; Chi Jin; Zhaoran Wang

Provably Efficient Exploration in Policy Optimization

Qi Cai, Zhuoran Yang, Chi Jin, Zhaoran Wang

Proceedings of the 37th International Conference on Machine Learning, PMLR 119:1283-1294, 2020.

Abstract

While policy-based reinforcement learning (RL) achieves tremendous successes in practice, it is significantly less understood in theory, especially compared with value-based RL. In particular, it remains elusive how to design a provably efficient policy optimization algorithm that incorporates exploration. To bridge such a gap, this paper proposes an Optimistic variant of the Proximal Policy Optimization algorithm (OPPO), which follows an “optimistic version” of the policy gradient direction. This paper proves that, in the problem of episodic Markov decision process with linear function approximation, unknown transition, and adversarial reward with full-information feedback, OPPO achieves $\tilde{O}(\sqrt{d^2 H^3 T})$ regret. Here $d$ is the feature dimension, $H$ is the episode horizon, and $T$ is the total number of steps. To the best of our knowledge, OPPO is the first provably efficient policy optimization algorithm that explores.

Cite this Paper

BibTeX

@InProceedings{pmlr-v119-cai20d,
  title = 	 {Provably Efficient Exploration in Policy Optimization},
  author =       {Cai, Qi and Yang, Zhuoran and Jin, Chi and Wang, Zhaoran},
  booktitle = 	 {Proceedings of the 37th International Conference on Machine Learning},
  pages = 	 {1283--1294},
  year = 	 {2020},
  editor = 	 {III, Hal Daumé and Singh, Aarti},
  volume = 	 {119},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {13--18 Jul},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v119/cai20d/cai20d.pdf},
  url = 	 {https://proceedings.mlr.press/v119/cai20d.html},
  abstract = 	 {While policy-based reinforcement learning (RL) achieves tremendous successes in practice, it is significantly less understood in theory, especially compared with value-based RL. In particular, it remains elusive how to design a provably efficient policy optimization algorithm that incorporates exploration. To bridge such a gap, this paper proposes an Optimistic variant of the Proximal Policy Optimization algorithm (OPPO), which follows an “optimistic version” of the policy gradient direction. This paper proves that, in the problem of episodic Markov decision process with linear function approximation, unknown transition, and adversarial reward with full-information feedback, OPPO achieves $\tilde{O}(\sqrt{d^2 H^3 T})$ regret. Here $d$ is the feature dimension, $H$ is the episode horizon, and $T$ is the total number of steps. To the best of our knowledge, OPPO is the first provably efficient policy optimization algorithm that explores.}
}

Endnote

%0 Conference Paper
%T Provably Efficient Exploration in Policy Optimization
%A Qi Cai
%A Zhuoran Yang
%A Chi Jin
%A Zhaoran Wang
%B Proceedings of the 37th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2020
%E Hal Daumé III
%E Aarti Singh	
%F pmlr-v119-cai20d
%I PMLR
%P 1283--1294
%U https://proceedings.mlr.press/v119/cai20d.html
%V 119
%X While policy-based reinforcement learning (RL) achieves tremendous successes in practice, it is significantly less understood in theory, especially compared with value-based RL. In particular, it remains elusive how to design a provably efficient policy optimization algorithm that incorporates exploration. To bridge such a gap, this paper proposes an Optimistic variant of the Proximal Policy Optimization algorithm (OPPO), which follows an “optimistic version” of the policy gradient direction. This paper proves that, in the problem of episodic Markov decision process with linear function approximation, unknown transition, and adversarial reward with full-information feedback, OPPO achieves $\tilde{O}(\sqrt{d^2 H^3 T})$ regret. Here $d$ is the feature dimension, $H$ is the episode horizon, and $T$ is the total number of steps. To the best of our knowledge, OPPO is the first provably efficient policy optimization algorithm that explores.

APA

Cai, Q., Yang, Z., Jin, C. & Wang, Z.. (2020). Provably Efficient Exploration in Policy Optimization. Proceedings of the 37th International Conference on Machine Learning, in Proceedings of Machine Learning Research 119:1283-1294 Available from https://proceedings.mlr.press/v119/cai20d.html.

Provably Efficient Exploration in Policy Optimization

Abstract

Cite this Paper

Related Material