Adaptive Trade-Offs in Off-Policy Learning

Mark Rowland, Will Dabney, Remi Munos
Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, PMLR 108:34-44, 2020.

Abstract

A great variety of off-policy learning algorithms exist in the literature, and new breakthroughs in this area continue to be made, improving theoretical understanding and yielding state-of-the-art reinforcement learning algorithms. In this paper, we take a unifying view of this space of algorithms, and consider their trade-offs of three fundamental quantities: update variance, fixed-point bias, and contraction rate. This leads to new perspectives on existing methods, and also naturally yields novel algorithms for off-policy evaluation and control. We develop one such algorithm, C-trace, demonstrating that it is able to more efficiently make these trade-offs than existing methods in use, and that it can be scaled to yield state-of-the-art performance in large-scale environments.

Cite this Paper


BibTeX
@InProceedings{pmlr-v108-rowland20a, title = {Adaptive Trade-Offs in Off-Policy Learning}, author = {Rowland, Mark and Dabney, Will and Munos, Remi}, booktitle = {Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics}, pages = {34--44}, year = {2020}, editor = {Chiappa, Silvia and Calandra, Roberto}, volume = {108}, series = {Proceedings of Machine Learning Research}, month = {26--28 Aug}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v108/rowland20a/rowland20a.pdf}, url = {https://proceedings.mlr.press/v108/rowland20a.html}, abstract = {A great variety of off-policy learning algorithms exist in the literature, and new breakthroughs in this area continue to be made, improving theoretical understanding and yielding state-of-the-art reinforcement learning algorithms. In this paper, we take a unifying view of this space of algorithms, and consider their trade-offs of three fundamental quantities: update variance, fixed-point bias, and contraction rate. This leads to new perspectives on existing methods, and also naturally yields novel algorithms for off-policy evaluation and control. We develop one such algorithm, C-trace, demonstrating that it is able to more efficiently make these trade-offs than existing methods in use, and that it can be scaled to yield state-of-the-art performance in large-scale environments.} }
Endnote
%0 Conference Paper %T Adaptive Trade-Offs in Off-Policy Learning %A Mark Rowland %A Will Dabney %A Remi Munos %B Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 2020 %E Silvia Chiappa %E Roberto Calandra %F pmlr-v108-rowland20a %I PMLR %P 34--44 %U https://proceedings.mlr.press/v108/rowland20a.html %V 108 %X A great variety of off-policy learning algorithms exist in the literature, and new breakthroughs in this area continue to be made, improving theoretical understanding and yielding state-of-the-art reinforcement learning algorithms. In this paper, we take a unifying view of this space of algorithms, and consider their trade-offs of three fundamental quantities: update variance, fixed-point bias, and contraction rate. This leads to new perspectives on existing methods, and also naturally yields novel algorithms for off-policy evaluation and control. We develop one such algorithm, C-trace, demonstrating that it is able to more efficiently make these trade-offs than existing methods in use, and that it can be scaled to yield state-of-the-art performance in large-scale environments.
APA
Rowland, M., Dabney, W. & Munos, R.. (2020). Adaptive Trade-Offs in Off-Policy Learning. Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 108:34-44 Available from https://proceedings.mlr.press/v108/rowland20a.html.

Related Material