OptiDICE: Offline Policy Optimization via Stationary Distribution Correction Estimation

Jongmin Lee, Wonseok Jeon, Byungjun Lee, Joelle Pineau, Kee-Eung Kim
Proceedings of the 38th International Conference on Machine Learning, PMLR 139:6120-6130, 2021.

Abstract

We consider the offline reinforcement learning (RL) setting where the agent aims to optimize the policy solely from the data without further environment interactions. In offline RL, the distributional shift becomes the primary source of difficulty, which arises from the deviation of the target policy being optimized from the behavior policy used for data collection. This typically causes overestimation of action values, which poses severe problems for model-free algorithms that use bootstrapping. To mitigate the problem, prior offline RL algorithms often used sophisticated techniques that encourage underestimation of action values, which introduces an additional set of hyperparameters that need to be tuned properly. In this paper, we present an offline RL algorithm that prevents overestimation in a more principled way. Our algorithm, OptiDICE, directly estimates the stationary distribution corrections of the optimal policy and does not rely on policy-gradients, unlike previous offline RL algorithms. Using an extensive set of benchmark datasets for offline RL, we show that OptiDICE performs competitively with the state-of-the-art methods.

Cite this Paper


BibTeX
@InProceedings{pmlr-v139-lee21f, title = {OptiDICE: Offline Policy Optimization via Stationary Distribution Correction Estimation}, author = {Lee, Jongmin and Jeon, Wonseok and Lee, Byungjun and Pineau, Joelle and Kim, Kee-Eung}, booktitle = {Proceedings of the 38th International Conference on Machine Learning}, pages = {6120--6130}, year = {2021}, editor = {Meila, Marina and Zhang, Tong}, volume = {139}, series = {Proceedings of Machine Learning Research}, month = {18--24 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v139/lee21f/lee21f.pdf}, url = {https://proceedings.mlr.press/v139/lee21f.html}, abstract = {We consider the offline reinforcement learning (RL) setting where the agent aims to optimize the policy solely from the data without further environment interactions. In offline RL, the distributional shift becomes the primary source of difficulty, which arises from the deviation of the target policy being optimized from the behavior policy used for data collection. This typically causes overestimation of action values, which poses severe problems for model-free algorithms that use bootstrapping. To mitigate the problem, prior offline RL algorithms often used sophisticated techniques that encourage underestimation of action values, which introduces an additional set of hyperparameters that need to be tuned properly. In this paper, we present an offline RL algorithm that prevents overestimation in a more principled way. Our algorithm, OptiDICE, directly estimates the stationary distribution corrections of the optimal policy and does not rely on policy-gradients, unlike previous offline RL algorithms. Using an extensive set of benchmark datasets for offline RL, we show that OptiDICE performs competitively with the state-of-the-art methods.} }
Endnote
%0 Conference Paper %T OptiDICE: Offline Policy Optimization via Stationary Distribution Correction Estimation %A Jongmin Lee %A Wonseok Jeon %A Byungjun Lee %A Joelle Pineau %A Kee-Eung Kim %B Proceedings of the 38th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2021 %E Marina Meila %E Tong Zhang %F pmlr-v139-lee21f %I PMLR %P 6120--6130 %U https://proceedings.mlr.press/v139/lee21f.html %V 139 %X We consider the offline reinforcement learning (RL) setting where the agent aims to optimize the policy solely from the data without further environment interactions. In offline RL, the distributional shift becomes the primary source of difficulty, which arises from the deviation of the target policy being optimized from the behavior policy used for data collection. This typically causes overestimation of action values, which poses severe problems for model-free algorithms that use bootstrapping. To mitigate the problem, prior offline RL algorithms often used sophisticated techniques that encourage underestimation of action values, which introduces an additional set of hyperparameters that need to be tuned properly. In this paper, we present an offline RL algorithm that prevents overestimation in a more principled way. Our algorithm, OptiDICE, directly estimates the stationary distribution corrections of the optimal policy and does not rely on policy-gradients, unlike previous offline RL algorithms. Using an extensive set of benchmark datasets for offline RL, we show that OptiDICE performs competitively with the state-of-the-art methods.
APA
Lee, J., Jeon, W., Lee, B., Pineau, J. & Kim, K.. (2021). OptiDICE: Offline Policy Optimization via Stationary Distribution Correction Estimation. Proceedings of the 38th International Conference on Machine Learning, in Proceedings of Machine Learning Research 139:6120-6130 Available from https://proceedings.mlr.press/v139/lee21f.html.

Related Material