Trajectory Data Suffices for Statistically Efficient Policy Evaluation in Fixed-Horizon Offline RL with Linear $q^\pi$-Realizability and Concentrability

Volodymyr Tkachuk, Csaba Szepesvári, Xiaoqi Tan
Proceedings of Thirty Ninth Conference on Learning Theory, PMLR 336:6372-6405, 2026.

Abstract

We study fixed-horizon offline reinforcement learning (RL) with function approximation for both policy evaluation and policy optimization. Prior work established that statistically efficient learning is impossible for either of these problems when the only assumptions are that the data has good coverage (concentrability) and the state-action value function of every policy is linearly realizable ($q^\pi$-realizability) [Foster et al., 2022]. Recently, Tkachuk et al. [2024] gave a statistically efficient learner for policy optimization, if in addition the data is assumed to be given as trajectories. In this work we present a statistically efficient learner for policy evaluation under the same assumptions, with the additional requirement that the behavior policy is known. Further, we show that the sample complexity of the learner used by Tkachuk et al. [2024] for policy optimization can be improved by a tighter analysis.

Cite this Paper


BibTeX
@InProceedings{pmlr-v336-tkachuk26a, title = {Trajectory Data Suffices for Statistically Efficient Policy Evaluation in Fixed-Horizon Offline RL with Linear $q^\pi$-Realizability and Concentrability}, author = {Tkachuk, Volodymyr and Szepesv\'ari, Csaba and Tan, Xiaoqi}, booktitle = {Proceedings of Thirty Ninth Conference on Learning Theory}, pages = {6372--6405}, year = {2026}, editor = {Hanneke, Steve and Lattimore, Tor}, volume = {336}, series = {Proceedings of Machine Learning Research}, month = {29 Jun--03 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v336/main/assets/tkachuk26a/tkachuk26a.pdf}, url = {https://proceedings.mlr.press/v336/tkachuk26a.html}, abstract = {We study fixed-horizon offline reinforcement learning (RL) with function approximation for both policy evaluation and policy optimization. Prior work established that statistically efficient learning is impossible for either of these problems when the only assumptions are that the data has good coverage (concentrability) and the state-action value function of every policy is linearly realizable ($q^\pi$-realizability) [Foster et al., 2022]. Recently, Tkachuk et al. [2024] gave a statistically efficient learner for policy optimization, if in addition the data is assumed to be given as trajectories. In this work we present a statistically efficient learner for policy evaluation under the same assumptions, with the additional requirement that the behavior policy is known. Further, we show that the sample complexity of the learner used by Tkachuk et al. [2024] for policy optimization can be improved by a tighter analysis. } }
Endnote
%0 Conference Paper %T Trajectory Data Suffices for Statistically Efficient Policy Evaluation in Fixed-Horizon Offline RL with Linear $q^\pi$-Realizability and Concentrability %A Volodymyr Tkachuk %A Csaba Szepesvári %A Xiaoqi Tan %B Proceedings of Thirty Ninth Conference on Learning Theory %C Proceedings of Machine Learning Research %D 2026 %E Steve Hanneke %E Tor Lattimore %F pmlr-v336-tkachuk26a %I PMLR %P 6372--6405 %U https://proceedings.mlr.press/v336/tkachuk26a.html %V 336 %X We study fixed-horizon offline reinforcement learning (RL) with function approximation for both policy evaluation and policy optimization. Prior work established that statistically efficient learning is impossible for either of these problems when the only assumptions are that the data has good coverage (concentrability) and the state-action value function of every policy is linearly realizable ($q^\pi$-realizability) [Foster et al., 2022]. Recently, Tkachuk et al. [2024] gave a statistically efficient learner for policy optimization, if in addition the data is assumed to be given as trajectories. In this work we present a statistically efficient learner for policy evaluation under the same assumptions, with the additional requirement that the behavior policy is known. Further, we show that the sample complexity of the learner used by Tkachuk et al. [2024] for policy optimization can be improved by a tighter analysis.
APA
Tkachuk, V., Szepesvári, C. & Tan, X.. (2026). Trajectory Data Suffices for Statistically Efficient Policy Evaluation in Fixed-Horizon Offline RL with Linear $q^\pi$-Realizability and Concentrability. Proceedings of Thirty Ninth Conference on Learning Theory, in Proceedings of Machine Learning Research 336:6372-6405 Available from https://proceedings.mlr.press/v336/tkachuk26a.html.

Related Material