[edit]
A finite-sample analysis of multi-step temporal difference estimates
Proceedings of The 5th Annual Learning for Dynamics and Control Conference, PMLR 211:612-624, 2023.
Abstract
We consider the problem of estimating the value function of an infinite-horizon γ-discounted Markov reward process (MRP). We establish non-asymptotic guarantees for a general family of multi-step temporal difference (TD) estimates, including canonical K-step look-ahead TD for K=1,2,… and the TD(λ) family for λ∈[0,1) as special cases. Our bounds capture the dependence of these estimates on both the variance as defined by Bellman fluctuations, and the bias arising from possible model mis-specification. Our results reveal that the variance component shows limited sensitivity to the choice of look-ahead defining the estimator itself, while increasing the look-ahead can reduce the bias term. This highlights the benefit of using a larger look-ahead: it reduces bias but need not increase the variance.