Implicit and Explicit Policy Constraints for Offline Reinforcement Learning

Yang Liu, Marius Hofert
Proceedings of the Third Conference on Causal Learning and Reasoning, PMLR 236:499-513, 2024.

Abstract

Offline reinforcement learning (RL) aims to improve the target policy over the behavior policy based on historical data. A major problem of offline RL is the distribution shift that causes overestimation of the Q-value due to out-of-distribution actions. Most existing works focus on either behavioral cloning (BC) or maximizing Q-Learning methods to suppress distribution shift. BC methods try to mitigate the shift by constraining the target policy to be close to the offline data, but it makes the learned policy highly conservative. On the other hand, maximizing Q-Learning methods adopt pessimism mechanism to generate actions by maximizing Q-value and penalizing Q-value according to the uncertainty of actions. However, the generated actions might be arbitrary, resulting in the predicted Q-values highly uncertain, which will in turn misguide the policy to generate the next action. To alleviate the adverse effect of the distribution shift, we propose to constrain the policy implicitly and explicitly by unifying Q-Learning and behavior cloning to tackle the exploration and exploitation dilemma. For the implicit constraint approach, we propose to unify the action space by generative adversarial networks that dedicate to make the actions of the target policy and behavior policy indistinguishable. For the explicit constraint approach, we propose multiple importance sampling (MIS) to learn an advantage weight for each state-action pair which is then used to suppress or make full use of each state-action pair. Extensive experiments on the D4RL dataset indicate that our approaches can achieve superior performance. The results on the Maze2D data indicate that MIS addresses heterogeneous data better than single importance sampling. We also found that MIS can stabilize the reward curve effectively.

Cite this Paper


BibTeX
@InProceedings{pmlr-v236-liu24a, title = {Implicit and Explicit Policy Constraints for Offline Reinforcement Learning}, author = {Liu, Yang and Hofert, Marius}, booktitle = {Proceedings of the Third Conference on Causal Learning and Reasoning}, pages = {499--513}, year = {2024}, editor = {Locatello, Francesco and Didelez, Vanessa}, volume = {236}, series = {Proceedings of Machine Learning Research}, month = {01--03 Apr}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v236/liu24a/liu24a.pdf}, url = {https://proceedings.mlr.press/v236/liu24a.html}, abstract = {Offline reinforcement learning (RL) aims to improve the target policy over the behavior policy based on historical data. A major problem of offline RL is the distribution shift that causes overestimation of the Q-value due to out-of-distribution actions. Most existing works focus on either behavioral cloning (BC) or maximizing Q-Learning methods to suppress distribution shift. BC methods try to mitigate the shift by constraining the target policy to be close to the offline data, but it makes the learned policy highly conservative. On the other hand, maximizing Q-Learning methods adopt pessimism mechanism to generate actions by maximizing Q-value and penalizing Q-value according to the uncertainty of actions. However, the generated actions might be arbitrary, resulting in the predicted Q-values highly uncertain, which will in turn misguide the policy to generate the next action. To alleviate the adverse effect of the distribution shift, we propose to constrain the policy implicitly and explicitly by unifying Q-Learning and behavior cloning to tackle the exploration and exploitation dilemma. For the implicit constraint approach, we propose to unify the action space by generative adversarial networks that dedicate to make the actions of the target policy and behavior policy indistinguishable. For the explicit constraint approach, we propose multiple importance sampling (MIS) to learn an advantage weight for each state-action pair which is then used to suppress or make full use of each state-action pair. Extensive experiments on the D4RL dataset indicate that our approaches can achieve superior performance. The results on the Maze2D data indicate that MIS addresses heterogeneous data better than single importance sampling. We also found that MIS can stabilize the reward curve effectively.} }
Endnote
%0 Conference Paper %T Implicit and Explicit Policy Constraints for Offline Reinforcement Learning %A Yang Liu %A Marius Hofert %B Proceedings of the Third Conference on Causal Learning and Reasoning %C Proceedings of Machine Learning Research %D 2024 %E Francesco Locatello %E Vanessa Didelez %F pmlr-v236-liu24a %I PMLR %P 499--513 %U https://proceedings.mlr.press/v236/liu24a.html %V 236 %X Offline reinforcement learning (RL) aims to improve the target policy over the behavior policy based on historical data. A major problem of offline RL is the distribution shift that causes overestimation of the Q-value due to out-of-distribution actions. Most existing works focus on either behavioral cloning (BC) or maximizing Q-Learning methods to suppress distribution shift. BC methods try to mitigate the shift by constraining the target policy to be close to the offline data, but it makes the learned policy highly conservative. On the other hand, maximizing Q-Learning methods adopt pessimism mechanism to generate actions by maximizing Q-value and penalizing Q-value according to the uncertainty of actions. However, the generated actions might be arbitrary, resulting in the predicted Q-values highly uncertain, which will in turn misguide the policy to generate the next action. To alleviate the adverse effect of the distribution shift, we propose to constrain the policy implicitly and explicitly by unifying Q-Learning and behavior cloning to tackle the exploration and exploitation dilemma. For the implicit constraint approach, we propose to unify the action space by generative adversarial networks that dedicate to make the actions of the target policy and behavior policy indistinguishable. For the explicit constraint approach, we propose multiple importance sampling (MIS) to learn an advantage weight for each state-action pair which is then used to suppress or make full use of each state-action pair. Extensive experiments on the D4RL dataset indicate that our approaches can achieve superior performance. The results on the Maze2D data indicate that MIS addresses heterogeneous data better than single importance sampling. We also found that MIS can stabilize the reward curve effectively.
APA
Liu, Y. & Hofert, M.. (2024). Implicit and Explicit Policy Constraints for Offline Reinforcement Learning. Proceedings of the Third Conference on Causal Learning and Reasoning, in Proceedings of Machine Learning Research 236:499-513 Available from https://proceedings.mlr.press/v236/liu24a.html.

Related Material