Rates for Offline Reinforcement Learning with Adaptively Collected Data

Sunil Madhow; Dan Qiao; Ming Yin; Yu-Xiang Wang

Rates for Offline Reinforcement Learning with Adaptively Collected Data

Sunil Madhow, Dan Qiao, Ming Yin, Yu-Xiang Wang

Proceedings of the 7th Annual Learning for Dynamics \& Control Conference, PMLR 283:259-271, 2025.

Abstract

Developing theoretical guarantees on the sample complexity of offline RL methods is an important step towards making data-hungry RL algorithms practically viable. Such results tend to hinge on unrealistic assumptions about the data distribution — namely that it comprises a set of i.i.d. trajectories collected by a single logging policy. We propose a relaxation of the i.i.d. setting that allows logging policies to depend adaptively upon previous data. For tabular MDPs, we show that minimax-optimal bounds on the sample complexity of offline policy evaluation (OPE) and offline policy learning (OPL) can be recovered under this adaptive setting, and also derive instance-dependent bounds. Finally, we conduct simulations to empirically analyze the behavior of these estimators under adaptive and non-adaptive data. We find that, even while controlling for logging policies, adaptive data can change the signed behavior of estimation error.

Cite this Paper

BibTeX

@InProceedings{pmlr-v283-madhow25a,
  title = 	 {Rates for Offline Reinforcement Learning with Adaptively Collected Data},
  author =       {Madhow, Sunil and Qiao, Dan and Yin, Ming and Wang, Yu-Xiang},
  booktitle = 	 {Proceedings of the 7th Annual Learning for Dynamics \& Control Conference},
  pages = 	 {259--271},
  year = 	 {2025},
  editor = 	 {Ozay, Necmiye and Balzano, Laura and Panagou, Dimitra and Abate, Alessandro},
  volume = 	 {283},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {04--06 Jun},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v283/main/assets/madhow25a/madhow25a.pdf},
  url = 	 {https://proceedings.mlr.press/v283/madhow25a.html},
  abstract = 	 {Developing theoretical guarantees on the sample complexity of offline RL methods is an important step towards making data-hungry RL algorithms practically viable. Such results tend to hinge on unrealistic assumptions about the data distribution — namely that it comprises a set of i.i.d. trajectories collected by a single logging policy. We propose a relaxation of the i.i.d. setting that allows logging policies to depend adaptively upon previous data. For tabular MDPs, we show that minimax-optimal bounds on the sample complexity of offline policy evaluation (OPE) and offline policy learning (OPL) can be recovered under this adaptive setting, and also derive instance-dependent bounds. Finally, we conduct simulations to empirically analyze the behavior of these estimators under adaptive and non-adaptive data. We find that, even while controlling for logging policies, adaptive data can change the signed behavior of estimation error.}
}

Endnote

%0 Conference Paper
%T Rates for Offline Reinforcement Learning with Adaptively Collected Data
%A Sunil Madhow
%A Dan Qiao
%A Ming Yin
%A Yu-Xiang Wang
%B Proceedings of the 7th Annual Learning for Dynamics \& Control Conference
%C Proceedings of Machine Learning Research
%D 2025
%E Necmiye Ozay
%E Laura Balzano
%E Dimitra Panagou
%E Alessandro Abate	
%F pmlr-v283-madhow25a
%I PMLR
%P 259--271
%U https://proceedings.mlr.press/v283/madhow25a.html
%V 283
%X Developing theoretical guarantees on the sample complexity of offline RL methods is an important step towards making data-hungry RL algorithms practically viable. Such results tend to hinge on unrealistic assumptions about the data distribution — namely that it comprises a set of i.i.d. trajectories collected by a single logging policy. We propose a relaxation of the i.i.d. setting that allows logging policies to depend adaptively upon previous data. For tabular MDPs, we show that minimax-optimal bounds on the sample complexity of offline policy evaluation (OPE) and offline policy learning (OPL) can be recovered under this adaptive setting, and also derive instance-dependent bounds. Finally, we conduct simulations to empirically analyze the behavior of these estimators under adaptive and non-adaptive data. We find that, even while controlling for logging policies, adaptive data can change the signed behavior of estimation error.

APA

Madhow, S., Qiao, D., Yin, M. & Wang, Y.. (2025). Rates for Offline Reinforcement Learning with Adaptively Collected Data. Proceedings of the 7th Annual Learning for Dynamics \& Control Conference, in Proceedings of Machine Learning Research 283:259-271 Available from https://proceedings.mlr.press/v283/madhow25a.html.

Related Material

Download PDF