[edit]
A Single Stepsize Suffices for Unprojected Linear TD(0): Simultaneous Robust and Fast Rates via Polyak–Ruppert Averaging
Proceedings of Thirty Ninth Conference on Learning Theory, PMLR 336:4588-4634, 2026.
Abstract
We study linear TD(0) under Markovian sampling, where data are generated along a single trajectory. We provide high-probability guarantees for a plain \emph{unprojected} TD(0) algorithm with Polyak–Ruppert (PR) averaging, using a \emph{single} stepsize schedule $\eta_t \propto 1/(\tau_{\mathrm{mix}}\log(t)\,\sqrt{t})$ that depends on mixing time but requires \emph{no prior knowledge of the curvature parameter $\omega$}. Our first result shows that such a choice of the stepsize guarantees that the TD(0) iterates are automatically and uniformly bounded \emph{with high probability}, without projections and without any stability argument based on $\omega$. Building on this result, we establish a simultaneous high-probability convergence guarantee for the PR average: the same stepsize yields both a robust curvature-free $\widetilde{\mathcal O}(\tau_{\mathrm{mix}}/\sqrt{T})$ rate and a fast curvature-dependent $\widetilde{\mathcal O}(\tau_{\mathrm{mix}}^2/(\omega T))$ rate, with the bound taking the minimum of the two. The core technical ingredient is a Poisson-equation toolkit for geometrically mixing Markov chains, which decomposes Markov noise into a martingale term plus a controlled remainder and enables a new self-bounding inductive argument for pathwise stability.