[edit]
Loosely consistent emphatic temporal-difference learning
Proceedings of the Thirty-Ninth Conference on Uncertainty in Artificial Intelligence, PMLR 216:849-859, 2023.
Abstract
There has been significant interest in searching for off-policy Temporal-Difference (TD) algorithms that find the same solution that would have been obtained in the on-policy regime. An important property of such algorithms is that their expected update has the same fixed point as that of On-policy TD(λ), which we call loose consistency. Notably, Full-IS-TD(λ) is the only existing loosely consistent method under general linear function approximation but, unfortunately, has a high variance and is scarcely practical. This notorious high variance issue motivates the introduction of ETD(λ), which tames down the variance but has a biased fixed point. Inspired by these two methods, we propose a new loosely consistent algorithm called Average Emphatic TD (AETD(λ)) with a transient bias, which strikes a balance between bias and variance. Further, we unify AETD(λ) with existing methods and obtain a new family of loosely consistent algorithms called Loosely Consistent Emphatic TD (LC-ETD(λ, β, ν)), which can control a smooth bias-variance trade-off by varying the speed at which the transient bias fades. Through experiments on illustrative examples, we show the effectiveness and practicality of LC-ETD(λ, β, ν).