Rates for Offline Reinforcement Learning with Adaptively Collected Data

Sunil Madhow, Dan Qiao, Ming Yin, Yu-Xiang Wang
Proceedings of the 7th Annual Learning for Dynamics \& Control Conference, PMLR 283:259-271, 2025.

Abstract

Developing theoretical guarantees on the sample complexity of offline RL methods is an important step towards making data-hungry RL algorithms practically viable. Such results tend to hinge on unrealistic assumptions about the data distribution — namely that it comprises a set of i.i.d. trajectories collected by a single logging policy. We propose a relaxation of the i.i.d. setting that allows logging policies to depend adaptively upon previous data. For tabular MDPs, we show that minimax-optimal bounds on the sample complexity of offline policy evaluation (OPE) and offline policy learning (OPL) can be recovered under this adaptive setting, and also derive instance-dependent bounds. Finally, we conduct simulations to empirically analyze the behavior of these estimators under adaptive and non-adaptive data. We find that, even while controlling for logging policies, adaptive data can change the signed behavior of estimation error.

Cite this Paper


BibTeX
@InProceedings{pmlr-v283-madhow25a, title = {Rates for Offline Reinforcement Learning with Adaptively Collected Data}, author = {Madhow, Sunil and Qiao, Dan and Yin, Ming and Wang, Yu-Xiang}, booktitle = {Proceedings of the 7th Annual Learning for Dynamics \& Control Conference}, pages = {259--271}, year = {2025}, editor = {Ozay, Necmiye and Balzano, Laura and Panagou, Dimitra and Abate, Alessandro}, volume = {283}, series = {Proceedings of Machine Learning Research}, month = {04--06 Jun}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v283/main/assets/madhow25a/madhow25a.pdf}, url = {https://proceedings.mlr.press/v283/madhow25a.html}, abstract = {Developing theoretical guarantees on the sample complexity of offline RL methods is an important step towards making data-hungry RL algorithms practically viable. Such results tend to hinge on unrealistic assumptions about the data distribution — namely that it comprises a set of i.i.d. trajectories collected by a single logging policy. We propose a relaxation of the i.i.d. setting that allows logging policies to depend adaptively upon previous data. For tabular MDPs, we show that minimax-optimal bounds on the sample complexity of offline policy evaluation (OPE) and offline policy learning (OPL) can be recovered under this adaptive setting, and also derive instance-dependent bounds. Finally, we conduct simulations to empirically analyze the behavior of these estimators under adaptive and non-adaptive data. We find that, even while controlling for logging policies, adaptive data can change the signed behavior of estimation error.} }
Endnote
%0 Conference Paper %T Rates for Offline Reinforcement Learning with Adaptively Collected Data %A Sunil Madhow %A Dan Qiao %A Ming Yin %A Yu-Xiang Wang %B Proceedings of the 7th Annual Learning for Dynamics \& Control Conference %C Proceedings of Machine Learning Research %D 2025 %E Necmiye Ozay %E Laura Balzano %E Dimitra Panagou %E Alessandro Abate %F pmlr-v283-madhow25a %I PMLR %P 259--271 %U https://proceedings.mlr.press/v283/madhow25a.html %V 283 %X Developing theoretical guarantees on the sample complexity of offline RL methods is an important step towards making data-hungry RL algorithms practically viable. Such results tend to hinge on unrealistic assumptions about the data distribution — namely that it comprises a set of i.i.d. trajectories collected by a single logging policy. We propose a relaxation of the i.i.d. setting that allows logging policies to depend adaptively upon previous data. For tabular MDPs, we show that minimax-optimal bounds on the sample complexity of offline policy evaluation (OPE) and offline policy learning (OPL) can be recovered under this adaptive setting, and also derive instance-dependent bounds. Finally, we conduct simulations to empirically analyze the behavior of these estimators under adaptive and non-adaptive data. We find that, even while controlling for logging policies, adaptive data can change the signed behavior of estimation error.
APA
Madhow, S., Qiao, D., Yin, M. & Wang, Y.. (2025). Rates for Offline Reinforcement Learning with Adaptively Collected Data. Proceedings of the 7th Annual Learning for Dynamics \& Control Conference, in Proceedings of Machine Learning Research 283:259-271 Available from https://proceedings.mlr.press/v283/madhow25a.html.

Related Material