Boosting Offline Reinforcement Learning with Action Preference Query

Qisen Yang, Shenzhi Wang, Matthieu Gaetan Lin, Shiji Song, Gao Huang
Proceedings of the 40th International Conference on Machine Learning, PMLR 202:39509-39523, 2023.

Abstract

Training practical agents usually involve offline and online reinforcement learning (RL) to balance the policy’s performance and interaction costs. In particular, online fine-tuning has become a commonly used method to correct the erroneous estimates of out-of-distribution data learned in the offline training phase. However, even limited online interactions can be inaccessible or catastrophic for high-stake scenarios like healthcare and autonomous driving. In this work, we introduce an interaction-free training scheme dubbed Offline-with-Action-Preferences (OAP). The main insight is that, compared to online fine-tuning, querying the preferences between pre-collected and learned actions can be equally or even more helpful to the erroneous estimate problem. By adaptively encouraging or suppressing policy constraint according to action preferences, OAP could distinguish overestimation from beneficial policy improvement and thus attains a more accurate evaluation of unseen data. Theoretically, we prove a lower bound of the behavior policy’s performance improvement brought by OAP. Moreover, comprehensive experiments on the D4RL benchmark and state-of-the-art algorithms demonstrate that OAP yields higher (29% on average) scores, especially on challenging AntMaze tasks (98% higher).

Cite this Paper


BibTeX
@InProceedings{pmlr-v202-yang23o, title = {Boosting Offline Reinforcement Learning with Action Preference Query}, author = {Yang, Qisen and Wang, Shenzhi and Lin, Matthieu Gaetan and Song, Shiji and Huang, Gao}, booktitle = {Proceedings of the 40th International Conference on Machine Learning}, pages = {39509--39523}, year = {2023}, editor = {Krause, Andreas and Brunskill, Emma and Cho, Kyunghyun and Engelhardt, Barbara and Sabato, Sivan and Scarlett, Jonathan}, volume = {202}, series = {Proceedings of Machine Learning Research}, month = {23--29 Jul}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v202/yang23o/yang23o.pdf}, url = {https://proceedings.mlr.press/v202/yang23o.html}, abstract = {Training practical agents usually involve offline and online reinforcement learning (RL) to balance the policy’s performance and interaction costs. In particular, online fine-tuning has become a commonly used method to correct the erroneous estimates of out-of-distribution data learned in the offline training phase. However, even limited online interactions can be inaccessible or catastrophic for high-stake scenarios like healthcare and autonomous driving. In this work, we introduce an interaction-free training scheme dubbed Offline-with-Action-Preferences (OAP). The main insight is that, compared to online fine-tuning, querying the preferences between pre-collected and learned actions can be equally or even more helpful to the erroneous estimate problem. By adaptively encouraging or suppressing policy constraint according to action preferences, OAP could distinguish overestimation from beneficial policy improvement and thus attains a more accurate evaluation of unseen data. Theoretically, we prove a lower bound of the behavior policy’s performance improvement brought by OAP. Moreover, comprehensive experiments on the D4RL benchmark and state-of-the-art algorithms demonstrate that OAP yields higher (29% on average) scores, especially on challenging AntMaze tasks (98% higher).} }
Endnote
%0 Conference Paper %T Boosting Offline Reinforcement Learning with Action Preference Query %A Qisen Yang %A Shenzhi Wang %A Matthieu Gaetan Lin %A Shiji Song %A Gao Huang %B Proceedings of the 40th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2023 %E Andreas Krause %E Emma Brunskill %E Kyunghyun Cho %E Barbara Engelhardt %E Sivan Sabato %E Jonathan Scarlett %F pmlr-v202-yang23o %I PMLR %P 39509--39523 %U https://proceedings.mlr.press/v202/yang23o.html %V 202 %X Training practical agents usually involve offline and online reinforcement learning (RL) to balance the policy’s performance and interaction costs. In particular, online fine-tuning has become a commonly used method to correct the erroneous estimates of out-of-distribution data learned in the offline training phase. However, even limited online interactions can be inaccessible or catastrophic for high-stake scenarios like healthcare and autonomous driving. In this work, we introduce an interaction-free training scheme dubbed Offline-with-Action-Preferences (OAP). The main insight is that, compared to online fine-tuning, querying the preferences between pre-collected and learned actions can be equally or even more helpful to the erroneous estimate problem. By adaptively encouraging or suppressing policy constraint according to action preferences, OAP could distinguish overestimation from beneficial policy improvement and thus attains a more accurate evaluation of unseen data. Theoretically, we prove a lower bound of the behavior policy’s performance improvement brought by OAP. Moreover, comprehensive experiments on the D4RL benchmark and state-of-the-art algorithms demonstrate that OAP yields higher (29% on average) scores, especially on challenging AntMaze tasks (98% higher).
APA
Yang, Q., Wang, S., Lin, M.G., Song, S. & Huang, G.. (2023). Boosting Offline Reinforcement Learning with Action Preference Query. Proceedings of the 40th International Conference on Machine Learning, in Proceedings of Machine Learning Research 202:39509-39523 Available from https://proceedings.mlr.press/v202/yang23o.html.

Related Material