Augmenting Online RL with Offline Data is All You Need: A Unified Hybrid RL Algorithm Design and Analysis

Ruiquan Huang, Donghao Li, Chengshuai Shi, Cong Shen, Jing Yang
Proceedings of the Forty-first Conference on Uncertainty in Artificial Intelligence, PMLR 286:1745-1767, 2025.

Abstract

This paper investigates a hybrid learning framework for reinforcement learning (RL) in which the agent can leverage both an offline dataset and online interactions to learn the optimal policy. We present a unified algorithm and analysis and show that augmenting confidence-based online RL algorithms with the offline dataset outperforms any pure online or offline algorithm alone and achieves state-of-the-art results under two learning metrics, i.e., sub-optimality gap and online learning regret. Specifically, we show that our algorithm achieves a sub-optimality gap $\tilde{O}( \sqrt{1/(N_0/ \mathtt{C}(\pi^\star| \rho)+N_1} ) )$, where $\mathtt{C}(\pi^\star|\rho)$ is a new concentrability coefficient, $N_0$ and $N_1$ are the numbers of offline and online samples, respectively. For regret minimization, we show that it achieves a constant $\tilde{O}( \sqrt{N_1/(N_0/\mathtt{C}(\pi^{-}|\rho)+N_1)} )$ speed-up compared to pure online learning, where $\mathtt{C}(\pi^-|\rho)$ is the concentrability coefficient over all sub-optimal policies. Our results also reveal an interesting separation on the desired coverage properties of the offline dataset for sub-optimality gap minimization and regret minimization. We further validate our theoretical findings in several experiments in special RL models such as linear contextual bandits and Markov decision processes (MDPs).

Cite this Paper


BibTeX
@InProceedings{pmlr-v286-huang25a, title = {Augmenting Online RL with Offline Data is All You Need: A Unified Hybrid RL Algorithm Design and Analysis}, author = {Huang, Ruiquan and Li, Donghao and Shi, Chengshuai and Shen, Cong and Yang, Jing}, booktitle = {Proceedings of the Forty-first Conference on Uncertainty in Artificial Intelligence}, pages = {1745--1767}, year = {2025}, editor = {Chiappa, Silvia and Magliacane, Sara}, volume = {286}, series = {Proceedings of Machine Learning Research}, month = {21--25 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v286/main/assets/huang25a/huang25a.pdf}, url = {https://proceedings.mlr.press/v286/huang25a.html}, abstract = {This paper investigates a hybrid learning framework for reinforcement learning (RL) in which the agent can leverage both an offline dataset and online interactions to learn the optimal policy. We present a unified algorithm and analysis and show that augmenting confidence-based online RL algorithms with the offline dataset outperforms any pure online or offline algorithm alone and achieves state-of-the-art results under two learning metrics, i.e., sub-optimality gap and online learning regret. Specifically, we show that our algorithm achieves a sub-optimality gap $\tilde{O}( \sqrt{1/(N_0/ \mathtt{C}(\pi^\star| \rho)+N_1} ) )$, where $\mathtt{C}(\pi^\star|\rho)$ is a new concentrability coefficient, $N_0$ and $N_1$ are the numbers of offline and online samples, respectively. For regret minimization, we show that it achieves a constant $\tilde{O}( \sqrt{N_1/(N_0/\mathtt{C}(\pi^{-}|\rho)+N_1)} )$ speed-up compared to pure online learning, where $\mathtt{C}(\pi^-|\rho)$ is the concentrability coefficient over all sub-optimal policies. Our results also reveal an interesting separation on the desired coverage properties of the offline dataset for sub-optimality gap minimization and regret minimization. We further validate our theoretical findings in several experiments in special RL models such as linear contextual bandits and Markov decision processes (MDPs).} }
Endnote
%0 Conference Paper %T Augmenting Online RL with Offline Data is All You Need: A Unified Hybrid RL Algorithm Design and Analysis %A Ruiquan Huang %A Donghao Li %A Chengshuai Shi %A Cong Shen %A Jing Yang %B Proceedings of the Forty-first Conference on Uncertainty in Artificial Intelligence %C Proceedings of Machine Learning Research %D 2025 %E Silvia Chiappa %E Sara Magliacane %F pmlr-v286-huang25a %I PMLR %P 1745--1767 %U https://proceedings.mlr.press/v286/huang25a.html %V 286 %X This paper investigates a hybrid learning framework for reinforcement learning (RL) in which the agent can leverage both an offline dataset and online interactions to learn the optimal policy. We present a unified algorithm and analysis and show that augmenting confidence-based online RL algorithms with the offline dataset outperforms any pure online or offline algorithm alone and achieves state-of-the-art results under two learning metrics, i.e., sub-optimality gap and online learning regret. Specifically, we show that our algorithm achieves a sub-optimality gap $\tilde{O}( \sqrt{1/(N_0/ \mathtt{C}(\pi^\star| \rho)+N_1} ) )$, where $\mathtt{C}(\pi^\star|\rho)$ is a new concentrability coefficient, $N_0$ and $N_1$ are the numbers of offline and online samples, respectively. For regret minimization, we show that it achieves a constant $\tilde{O}( \sqrt{N_1/(N_0/\mathtt{C}(\pi^{-}|\rho)+N_1)} )$ speed-up compared to pure online learning, where $\mathtt{C}(\pi^-|\rho)$ is the concentrability coefficient over all sub-optimal policies. Our results also reveal an interesting separation on the desired coverage properties of the offline dataset for sub-optimality gap minimization and regret minimization. We further validate our theoretical findings in several experiments in special RL models such as linear contextual bandits and Markov decision processes (MDPs).
APA
Huang, R., Li, D., Shi, C., Shen, C. & Yang, J.. (2025). Augmenting Online RL with Offline Data is All You Need: A Unified Hybrid RL Algorithm Design and Analysis. Proceedings of the Forty-first Conference on Uncertainty in Artificial Intelligence, in Proceedings of Machine Learning Research 286:1745-1767 Available from https://proceedings.mlr.press/v286/huang25a.html.

Related Material