Batch Reinforcement Learning with Hyperparameter Gradients

Byungjun Lee, Jongmin Lee, Peter Vrancx, Dongho Kim, Kee-Eung Kim
Proceedings of the 37th International Conference on Machine Learning, PMLR 119:5725-5735, 2020.

Abstract

We consider the batch reinforcement learning problem where the agent needs to learn only from a fixed batch of data, without further interaction with the environment. In such a scenario, we want to prevent the optimized policy from deviating too much from the data collection policy since the estimation becomes highly unstable otherwise due to the off-policy nature of the problem. However, imposing this requirement too strongly will result in a policy that merely follows the data collection policy. Unlike prior work where this trade-off is controlled by hand-tuned hyperparameters, we propose a novel batch reinforcement learning approach, batch optimization of policy and hyperparameter (BOPAH), that uses a gradient-based optimization of the hyperparameter using held-out data. We show that BOPAH outperforms other batch reinforcement learning algorithms in tabular and continuous control tasks, by finding a good balance to the trade-off between adhering to the data collection policy and pursuing the possible policy improvement.

Cite this Paper


BibTeX
@InProceedings{pmlr-v119-lee20d, title = {Batch Reinforcement Learning with Hyperparameter Gradients}, author = {Lee, Byungjun and Lee, Jongmin and Vrancx, Peter and Kim, Dongho and Kim, Kee-Eung}, booktitle = {Proceedings of the 37th International Conference on Machine Learning}, pages = {5725--5735}, year = {2020}, editor = {III, Hal Daumé and Singh, Aarti}, volume = {119}, series = {Proceedings of Machine Learning Research}, month = {13--18 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v119/lee20d/lee20d.pdf}, url = {https://proceedings.mlr.press/v119/lee20d.html}, abstract = {We consider the batch reinforcement learning problem where the agent needs to learn only from a fixed batch of data, without further interaction with the environment. In such a scenario, we want to prevent the optimized policy from deviating too much from the data collection policy since the estimation becomes highly unstable otherwise due to the off-policy nature of the problem. However, imposing this requirement too strongly will result in a policy that merely follows the data collection policy. Unlike prior work where this trade-off is controlled by hand-tuned hyperparameters, we propose a novel batch reinforcement learning approach, batch optimization of policy and hyperparameter (BOPAH), that uses a gradient-based optimization of the hyperparameter using held-out data. We show that BOPAH outperforms other batch reinforcement learning algorithms in tabular and continuous control tasks, by finding a good balance to the trade-off between adhering to the data collection policy and pursuing the possible policy improvement.} }
Endnote
%0 Conference Paper %T Batch Reinforcement Learning with Hyperparameter Gradients %A Byungjun Lee %A Jongmin Lee %A Peter Vrancx %A Dongho Kim %A Kee-Eung Kim %B Proceedings of the 37th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2020 %E Hal Daumé III %E Aarti Singh %F pmlr-v119-lee20d %I PMLR %P 5725--5735 %U https://proceedings.mlr.press/v119/lee20d.html %V 119 %X We consider the batch reinforcement learning problem where the agent needs to learn only from a fixed batch of data, without further interaction with the environment. In such a scenario, we want to prevent the optimized policy from deviating too much from the data collection policy since the estimation becomes highly unstable otherwise due to the off-policy nature of the problem. However, imposing this requirement too strongly will result in a policy that merely follows the data collection policy. Unlike prior work where this trade-off is controlled by hand-tuned hyperparameters, we propose a novel batch reinforcement learning approach, batch optimization of policy and hyperparameter (BOPAH), that uses a gradient-based optimization of the hyperparameter using held-out data. We show that BOPAH outperforms other batch reinforcement learning algorithms in tabular and continuous control tasks, by finding a good balance to the trade-off between adhering to the data collection policy and pursuing the possible policy improvement.
APA
Lee, B., Lee, J., Vrancx, P., Kim, D. & Kim, K.. (2020). Batch Reinforcement Learning with Hyperparameter Gradients. Proceedings of the 37th International Conference on Machine Learning, in Proceedings of Machine Learning Research 119:5725-5735 Available from https://proceedings.mlr.press/v119/lee20d.html.

Related Material