Offline Reinforcement Learning with Closed-Form Policy Improvement Operators

Jiachen Li; Edwin Zhang; Ming Yin; Qinxun Bai; Yu-Xiang Wang; William Yang Wang

Offline Reinforcement Learning with Closed-Form Policy Improvement Operators

Jiachen Li, Edwin Zhang, Ming Yin, Qinxun Bai, Yu-Xiang Wang, William Yang Wang

Proceedings of the 40th International Conference on Machine Learning, PMLR 202:20485-20528, 2023.

Abstract

Behavior constrained policy optimization has been demonstrated to be a successful paradigm for tackling Offline Reinforcement Learning. By exploiting historical transitions, a policy is trained to maximize a learned value function while constrained by the behavior policy to avoid a significant distributional shift. In this paper, we propose our closed-form policy improvement operators. We make a novel observation that the behavior constraint naturally motivates the use of first-order Taylor approximation, leading to a linear approximation of the policy objective. Additionally, as practical datasets are usually collected by heterogeneous policies, we model the behavior policies as a Gaussian Mixture and overcome the induced optimization difficulties by leveraging the LogSumExp’s lower bound and Jensen’s Inequality, giving rise to a closed-form policy improvement operator. We instantiate both one-step and iterative offline RL algorithms with our novel policy improvement operators and empirically demonstrate their effectiveness over state-of-the-art algorithms on the standard D4RL benchmark. Our code is available at https://cfpi-icml23.github.io/.

Cite this Paper

BibTeX


@InProceedings{pmlr-v202-li23av,
  title = 	 {Offline Reinforcement Learning with Closed-Form Policy Improvement Operators},
  author =       {Li, Jiachen and Zhang, Edwin and Yin, Ming and Bai, Qinxun and Wang, Yu-Xiang and Wang, William Yang},
  booktitle = 	 {Proceedings of the 40th International Conference on Machine Learning},
  pages = 	 {20485--20528},
  year = 	 {2023},
  editor = 	 {Krause, Andreas and Brunskill, Emma and Cho, Kyunghyun and Engelhardt, Barbara and Sabato, Sivan and Scarlett, Jonathan},
  volume = 	 {202},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {23--29 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v202/li23av/li23av.pdf},
  url = 	 {https://proceedings.mlr.press/v202/li23av.html},
  abstract = 	 {Behavior constrained policy optimization has been demonstrated to be a successful paradigm for tackling Offline Reinforcement Learning. By exploiting historical transitions, a policy is trained to maximize a learned value function while constrained by the behavior policy to avoid a significant distributional shift. In this paper, we propose our closed-form policy improvement operators. We make a novel observation that the behavior constraint naturally motivates the use of first-order Taylor approximation, leading to a linear approximation of the policy objective. Additionally, as practical datasets are usually collected by heterogeneous policies, we model the behavior policies as a Gaussian Mixture and overcome the induced optimization difficulties by leveraging the LogSumExp’s lower bound and Jensen’s Inequality, giving rise to a closed-form policy improvement operator. We instantiate both one-step and iterative offline RL algorithms with our novel policy improvement operators and empirically demonstrate their effectiveness over state-of-the-art algorithms on the standard D4RL benchmark. Our code is available at https://cfpi-icml23.github.io/.}
}

Endnote

%0 Conference Paper
%T Offline Reinforcement Learning with Closed-Form Policy Improvement Operators
%A Jiachen Li
%A Edwin Zhang
%A Ming Yin
%A Qinxun Bai
%A Yu-Xiang Wang
%A William Yang Wang
%B Proceedings of the 40th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2023
%E Andreas Krause
%E Emma Brunskill
%E Kyunghyun Cho
%E Barbara Engelhardt
%E Sivan Sabato
%E Jonathan Scarlett	
%F pmlr-v202-li23av
%I PMLR
%P 20485--20528
%U https://proceedings.mlr.press/v202/li23av.html
%V 202
%X Behavior constrained policy optimization has been demonstrated to be a successful paradigm for tackling Offline Reinforcement Learning. By exploiting historical transitions, a policy is trained to maximize a learned value function while constrained by the behavior policy to avoid a significant distributional shift. In this paper, we propose our closed-form policy improvement operators. We make a novel observation that the behavior constraint naturally motivates the use of first-order Taylor approximation, leading to a linear approximation of the policy objective. Additionally, as practical datasets are usually collected by heterogeneous policies, we model the behavior policies as a Gaussian Mixture and overcome the induced optimization difficulties by leveraging the LogSumExp’s lower bound and Jensen’s Inequality, giving rise to a closed-form policy improvement operator. We instantiate both one-step and iterative offline RL algorithms with our novel policy improvement operators and empirically demonstrate their effectiveness over state-of-the-art algorithms on the standard D4RL benchmark. Our code is available at https://cfpi-icml23.github.io/.

APA


Li, J., Zhang, E., Yin, M., Bai, Q., Wang, Y. & Wang, W.Y.. (2023). Offline Reinforcement Learning with Closed-Form Policy Improvement Operators. Proceedings of the 40th International Conference on Machine Learning, in Proceedings of Machine Learning Research 202:20485-20528 Available from https://proceedings.mlr.press/v202/li23av.html.

Offline Reinforcement Learning with Closed-Form Policy Improvement Operators

Abstract

Cite this Paper

Related Material