Bayesian Design Principles for Offline-to-Online Reinforcement Learning

Hao Hu, Yiqin Yang, Jianing Ye, Chengjie Wu, Ziqing Mai, Yujing Hu, Tangjie Lv, Changjie Fan, Qianchuan Zhao, Chongjie Zhang
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:19491-19515, 2024.

Abstract

Offline reinforcement learning (RL) is crucial for real-world applications where exploration can be costly or unsafe. However, offline learned policies are often suboptimal, and further online fine-tuning is required. In this paper, we tackle the fundamental dilemma of offline-to-online fine-tuning: if the agent remains pessimistic, it may fail to learn a better policy, while if it becomes optimistic directly, performance may suffer from a sudden drop. We show that Bayesian design principles are crucial in solving such a dilemma. Instead of adopting optimistic or pessimistic policies, the agent should act in a way that matches its belief in optimal policies. Such a probability-matching agent can avoid a sudden performance drop while still being guaranteed to find the optimal policy. Based on our theoretical findings, we introduce a novel algorithm that outperforms existing methods on various benchmarks, demonstrating the efficacy of our approach. Overall, the proposed approach provides a new perspective on offline-to-online RL that has the potential to enable more effective learning from offline data.

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-hu24p, title = {{B}ayesian Design Principles for Offline-to-Online Reinforcement Learning}, author = {Hu, Hao and Yang, Yiqin and Ye, Jianing and Wu, Chengjie and Mai, Ziqing and Hu, Yujing and Lv, Tangjie and Fan, Changjie and Zhao, Qianchuan and Zhang, Chongjie}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {19491--19515}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/hu24p/hu24p.pdf}, url = {https://proceedings.mlr.press/v235/hu24p.html}, abstract = {Offline reinforcement learning (RL) is crucial for real-world applications where exploration can be costly or unsafe. However, offline learned policies are often suboptimal, and further online fine-tuning is required. In this paper, we tackle the fundamental dilemma of offline-to-online fine-tuning: if the agent remains pessimistic, it may fail to learn a better policy, while if it becomes optimistic directly, performance may suffer from a sudden drop. We show that Bayesian design principles are crucial in solving such a dilemma. Instead of adopting optimistic or pessimistic policies, the agent should act in a way that matches its belief in optimal policies. Such a probability-matching agent can avoid a sudden performance drop while still being guaranteed to find the optimal policy. Based on our theoretical findings, we introduce a novel algorithm that outperforms existing methods on various benchmarks, demonstrating the efficacy of our approach. Overall, the proposed approach provides a new perspective on offline-to-online RL that has the potential to enable more effective learning from offline data.} }
Endnote
%0 Conference Paper %T Bayesian Design Principles for Offline-to-Online Reinforcement Learning %A Hao Hu %A Yiqin Yang %A Jianing Ye %A Chengjie Wu %A Ziqing Mai %A Yujing Hu %A Tangjie Lv %A Changjie Fan %A Qianchuan Zhao %A Chongjie Zhang %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-hu24p %I PMLR %P 19491--19515 %U https://proceedings.mlr.press/v235/hu24p.html %V 235 %X Offline reinforcement learning (RL) is crucial for real-world applications where exploration can be costly or unsafe. However, offline learned policies are often suboptimal, and further online fine-tuning is required. In this paper, we tackle the fundamental dilemma of offline-to-online fine-tuning: if the agent remains pessimistic, it may fail to learn a better policy, while if it becomes optimistic directly, performance may suffer from a sudden drop. We show that Bayesian design principles are crucial in solving such a dilemma. Instead of adopting optimistic or pessimistic policies, the agent should act in a way that matches its belief in optimal policies. Such a probability-matching agent can avoid a sudden performance drop while still being guaranteed to find the optimal policy. Based on our theoretical findings, we introduce a novel algorithm that outperforms existing methods on various benchmarks, demonstrating the efficacy of our approach. Overall, the proposed approach provides a new perspective on offline-to-online RL that has the potential to enable more effective learning from offline data.
APA
Hu, H., Yang, Y., Ye, J., Wu, C., Mai, Z., Hu, Y., Lv, T., Fan, C., Zhao, Q. & Zhang, C.. (2024). Bayesian Design Principles for Offline-to-Online Reinforcement Learning. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:19491-19515 Available from https://proceedings.mlr.press/v235/hu24p.html.

Related Material