Preferential Temporal Difference Learning

Nishanth V. Anand, Doina Precup
Proceedings of the 38th International Conference on Machine Learning, PMLR 139:286-296, 2021.

Abstract

Temporal-Difference (TD) learning is a general and very useful tool for estimating the value function of a given policy, which in turn is required to find good policies. Generally speaking, TD learning updates states whenever they are visited. When the agent lands in a state, its value can be used to compute the TD-error, which is then propagated to other states. However, it may be interesting, when computing updates, to take into account other information than whether a state is visited or not. For example, some states might be more important than others (such as states which are frequently seen in a successful trajectory). Or, some states might have unreliable value estimates (for example, due to partial observability or lack of data), making their values less desirable as targets. We propose an approach to re-weighting states used in TD updates, both when they are the input and when they provide the target for the update. We prove that our approach converges with linear function approximation and illustrate its desirable empirical behaviour compared to other TD-style methods.

Cite this Paper


BibTeX
@InProceedings{pmlr-v139-anand21a, title = {Preferential Temporal Difference Learning}, author = {Anand, Nishanth V and Precup, Doina}, booktitle = {Proceedings of the 38th International Conference on Machine Learning}, pages = {286--296}, year = {2021}, editor = {Meila, Marina and Zhang, Tong}, volume = {139}, series = {Proceedings of Machine Learning Research}, month = {18--24 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v139/anand21a/anand21a.pdf}, url = {https://proceedings.mlr.press/v139/anand21a.html}, abstract = {Temporal-Difference (TD) learning is a general and very useful tool for estimating the value function of a given policy, which in turn is required to find good policies. Generally speaking, TD learning updates states whenever they are visited. When the agent lands in a state, its value can be used to compute the TD-error, which is then propagated to other states. However, it may be interesting, when computing updates, to take into account other information than whether a state is visited or not. For example, some states might be more important than others (such as states which are frequently seen in a successful trajectory). Or, some states might have unreliable value estimates (for example, due to partial observability or lack of data), making their values less desirable as targets. We propose an approach to re-weighting states used in TD updates, both when they are the input and when they provide the target for the update. We prove that our approach converges with linear function approximation and illustrate its desirable empirical behaviour compared to other TD-style methods.} }
Endnote
%0 Conference Paper %T Preferential Temporal Difference Learning %A Nishanth V. Anand %A Doina Precup %B Proceedings of the 38th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2021 %E Marina Meila %E Tong Zhang %F pmlr-v139-anand21a %I PMLR %P 286--296 %U https://proceedings.mlr.press/v139/anand21a.html %V 139 %X Temporal-Difference (TD) learning is a general and very useful tool for estimating the value function of a given policy, which in turn is required to find good policies. Generally speaking, TD learning updates states whenever they are visited. When the agent lands in a state, its value can be used to compute the TD-error, which is then propagated to other states. However, it may be interesting, when computing updates, to take into account other information than whether a state is visited or not. For example, some states might be more important than others (such as states which are frequently seen in a successful trajectory). Or, some states might have unreliable value estimates (for example, due to partial observability or lack of data), making their values less desirable as targets. We propose an approach to re-weighting states used in TD updates, both when they are the input and when they provide the target for the update. We prove that our approach converges with linear function approximation and illustrate its desirable empirical behaviour compared to other TD-style methods.
APA
Anand, N.V. & Precup, D.. (2021). Preferential Temporal Difference Learning. Proceedings of the 38th International Conference on Machine Learning, in Proceedings of Machine Learning Research 139:286-296 Available from https://proceedings.mlr.press/v139/anand21a.html.

Related Material