A Continuous-Time View of Early Stopping for Least Squares Regression

Alnur Ali, J. Zico Kolter, Ryan J. Tibshirani
Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, PMLR 89:1370-1378, 2019.

Abstract

We study the statistical properties of the iterates generated by gradient descent, applied to the fundamental problem of least squares regression. We take a continuous-time view, i.e., consider infinitesimal step sizes in gradient descent, in which case the iterates form a trajectory called gradient flow. Our primary focus is to compare the risk of gradient flow to that of ridge regression. Under the calibration $t=1/\lambda$—where $t$ is the time parameter in gradient flow, and $\lambda$ the tuning parameter in ridge regression—we prove that the risk of gradient flow is no less than 1.69 times that of ridge, along the entire path (for all $t \geq 0$). This holds in finite samples with very weak assumptions on the data model (in particular, with no assumptions on the features $X$). We prove that the same relative risk bound holds for prediction risk, in an average sense over the underlying signal $\beta_0$. Finally, we examine limiting risk expressions (under standard Marchenko-Pastur asymptotics), and give supporting numerical experiments.

Cite this Paper


BibTeX
@InProceedings{pmlr-v89-ali19a, title = {A Continuous-Time View of Early Stopping for Least Squares Regression}, author = {Ali, Alnur and Kolter, J. Zico and Tibshirani, Ryan J.}, booktitle = {Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics}, pages = {1370--1378}, year = {2019}, editor = {Chaudhuri, Kamalika and Sugiyama, Masashi}, volume = {89}, series = {Proceedings of Machine Learning Research}, month = {16--18 Apr}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v89/ali19a/ali19a.pdf}, url = {https://proceedings.mlr.press/v89/ali19a.html}, abstract = {We study the statistical properties of the iterates generated by gradient descent, applied to the fundamental problem of least squares regression. We take a continuous-time view, i.e., consider infinitesimal step sizes in gradient descent, in which case the iterates form a trajectory called gradient flow. Our primary focus is to compare the risk of gradient flow to that of ridge regression. Under the calibration $t=1/\lambda$—where $t$ is the time parameter in gradient flow, and $\lambda$ the tuning parameter in ridge regression—we prove that the risk of gradient flow is no less than 1.69 times that of ridge, along the entire path (for all $t \geq 0$). This holds in finite samples with very weak assumptions on the data model (in particular, with no assumptions on the features $X$). We prove that the same relative risk bound holds for prediction risk, in an average sense over the underlying signal $\beta_0$. Finally, we examine limiting risk expressions (under standard Marchenko-Pastur asymptotics), and give supporting numerical experiments.} }
Endnote
%0 Conference Paper %T A Continuous-Time View of Early Stopping for Least Squares Regression %A Alnur Ali %A J. Zico Kolter %A Ryan J. Tibshirani %B Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 2019 %E Kamalika Chaudhuri %E Masashi Sugiyama %F pmlr-v89-ali19a %I PMLR %P 1370--1378 %U https://proceedings.mlr.press/v89/ali19a.html %V 89 %X We study the statistical properties of the iterates generated by gradient descent, applied to the fundamental problem of least squares regression. We take a continuous-time view, i.e., consider infinitesimal step sizes in gradient descent, in which case the iterates form a trajectory called gradient flow. Our primary focus is to compare the risk of gradient flow to that of ridge regression. Under the calibration $t=1/\lambda$—where $t$ is the time parameter in gradient flow, and $\lambda$ the tuning parameter in ridge regression—we prove that the risk of gradient flow is no less than 1.69 times that of ridge, along the entire path (for all $t \geq 0$). This holds in finite samples with very weak assumptions on the data model (in particular, with no assumptions on the features $X$). We prove that the same relative risk bound holds for prediction risk, in an average sense over the underlying signal $\beta_0$. Finally, we examine limiting risk expressions (under standard Marchenko-Pastur asymptotics), and give supporting numerical experiments.
APA
Ali, A., Kolter, J.Z. & Tibshirani, R.J.. (2019). A Continuous-Time View of Early Stopping for Least Squares Regression. Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 89:1370-1378 Available from https://proceedings.mlr.press/v89/ali19a.html.

Related Material