Least-Squares Temporal Difference Learning for the Linear Quadratic Regulator

Stephen Tu, Benjamin Recht
Proceedings of the 35th International Conference on Machine Learning, PMLR 80:5005-5014, 2018.

Abstract

Reinforcement learning (RL) has been successfully used to solve many continuous control tasks. Despite its impressive results however, fundamental questions regarding the sample complexity of RL on continuous problems remain open. We study the performance of RL in this setting by considering the behavior of the Least-Squares Temporal Difference (LSTD) estimator on the classic Linear Quadratic Regulator (LQR) problem from optimal control. We give the first finite-time analysis of the number of samples needed to estimate the value function for a fixed static state-feedback policy to within epsilon-relative error. In the process of deriving our result, we give a general characterization for when the minimum eigenvalue of the empirical covariance matrix formed along the sample path of a fast-mixing stochastic process concentrates above zero, extending a result by Koltchinskii and Mendelson in the independent covariates setting. Finally, we provide experimental evidence indicating that our analysis correctly captures the qualitative behavior of LSTD on several LQR instances.

Cite this Paper


BibTeX
@InProceedings{pmlr-v80-tu18a, title = {Least-Squares Temporal Difference Learning for the Linear Quadratic Regulator}, author = {Tu, Stephen and Recht, Benjamin}, booktitle = {Proceedings of the 35th International Conference on Machine Learning}, pages = {5005--5014}, year = {2018}, editor = {Dy, Jennifer and Krause, Andreas}, volume = {80}, series = {Proceedings of Machine Learning Research}, month = {10--15 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v80/tu18a/tu18a.pdf}, url = {https://proceedings.mlr.press/v80/tu18a.html}, abstract = {Reinforcement learning (RL) has been successfully used to solve many continuous control tasks. Despite its impressive results however, fundamental questions regarding the sample complexity of RL on continuous problems remain open. We study the performance of RL in this setting by considering the behavior of the Least-Squares Temporal Difference (LSTD) estimator on the classic Linear Quadratic Regulator (LQR) problem from optimal control. We give the first finite-time analysis of the number of samples needed to estimate the value function for a fixed static state-feedback policy to within epsilon-relative error. In the process of deriving our result, we give a general characterization for when the minimum eigenvalue of the empirical covariance matrix formed along the sample path of a fast-mixing stochastic process concentrates above zero, extending a result by Koltchinskii and Mendelson in the independent covariates setting. Finally, we provide experimental evidence indicating that our analysis correctly captures the qualitative behavior of LSTD on several LQR instances.} }
Endnote
%0 Conference Paper %T Least-Squares Temporal Difference Learning for the Linear Quadratic Regulator %A Stephen Tu %A Benjamin Recht %B Proceedings of the 35th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2018 %E Jennifer Dy %E Andreas Krause %F pmlr-v80-tu18a %I PMLR %P 5005--5014 %U https://proceedings.mlr.press/v80/tu18a.html %V 80 %X Reinforcement learning (RL) has been successfully used to solve many continuous control tasks. Despite its impressive results however, fundamental questions regarding the sample complexity of RL on continuous problems remain open. We study the performance of RL in this setting by considering the behavior of the Least-Squares Temporal Difference (LSTD) estimator on the classic Linear Quadratic Regulator (LQR) problem from optimal control. We give the first finite-time analysis of the number of samples needed to estimate the value function for a fixed static state-feedback policy to within epsilon-relative error. In the process of deriving our result, we give a general characterization for when the minimum eigenvalue of the empirical covariance matrix formed along the sample path of a fast-mixing stochastic process concentrates above zero, extending a result by Koltchinskii and Mendelson in the independent covariates setting. Finally, we provide experimental evidence indicating that our analysis correctly captures the qualitative behavior of LSTD on several LQR instances.
APA
Tu, S. & Recht, B.. (2018). Least-Squares Temporal Difference Learning for the Linear Quadratic Regulator. Proceedings of the 35th International Conference on Machine Learning, in Proceedings of Machine Learning Research 80:5005-5014 Available from https://proceedings.mlr.press/v80/tu18a.html.

Related Material