A Continuous-Time View of Early Stopping for Least Squares Regression

Alnur Ali; J. Zico Kolter; Ryan J. Tibshirani

A Continuous-Time View of Early Stopping for Least Squares Regression

Alnur Ali, J. Zico Kolter, Ryan J. Tibshirani

Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, PMLR 89:1370-1378, 2019.

Abstract

We study the statistical properties of the iterates generated by gradient descent, applied to the fundamental problem of least squares regression. We take a continuous-time view, i.e., consider infinitesimal step sizes in gradient descent, in which case the iterates form a trajectory called gradient flow. Our primary focus is to compare the risk of gradient flow to that of ridge regression. Under the calibration $t=1/\lambda$—where $t$ is the time parameter in gradient flow, and $\lambda$ the tuning parameter in ridge regression—we prove that the risk of gradient flow is no less than 1.69 times that of ridge, along the entire path (for all $t \geq 0$). This holds in finite samples with very weak assumptions on the data model (in particular, with no assumptions on the features $X$). We prove that the same relative risk bound holds for prediction risk, in an average sense over the underlying signal $\beta_0$. Finally, we examine limiting risk expressions (under standard Marchenko-Pastur asymptotics), and give supporting numerical experiments.

Cite this Paper

BibTeX

@InProceedings{pmlr-v89-ali19a,
  title = 	 {A Continuous-Time View of Early Stopping for Least Squares Regression},
  author =       {Ali, Alnur and Kolter, J. Zico and Tibshirani, Ryan J.},
  booktitle = 	 {Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics},
  pages = 	 {1370--1378},
  year = 	 {2019},
  editor = 	 {Chaudhuri, Kamalika and Sugiyama, Masashi},
  volume = 	 {89},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {16--18 Apr},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v89/ali19a/ali19a.pdf},
  url = 	 {https://proceedings.mlr.press/v89/ali19a.html},
  abstract = 	 {We study the statistical properties of the iterates generated by gradient descent, applied to the fundamental problem of least squares regression. We take a continuous-time view, i.e., consider infinitesimal step sizes in gradient descent, in which case the iterates form a trajectory called gradient flow.  Our primary focus is to compare the risk of gradient flow to that of ridge regression. Under the calibration $t=1/\lambda$—where $t$ is the time parameter in gradient flow, and $\lambda$ the tuning parameter in ridge regression—we prove that the risk of gradient flow is no less than 1.69 times that of ridge, along the entire path (for all $t \geq 0$). This holds in finite samples with very weak assumptions on the data model (in particular, with no assumptions on the features $X$). We prove that the same relative risk bound holds for prediction risk, in an average sense over the underlying signal $\beta_0$.   Finally, we examine limiting risk expressions (under standard Marchenko-Pastur asymptotics), and give supporting numerical experiments.}
}

Endnote

%0 Conference Paper
%T A Continuous-Time View of Early Stopping for Least Squares Regression
%A Alnur Ali
%A J. Zico Kolter
%A Ryan J. Tibshirani
%B Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics
%C Proceedings of Machine Learning Research
%D 2019
%E Kamalika Chaudhuri
%E Masashi Sugiyama	
%F pmlr-v89-ali19a
%I PMLR
%P 1370--1378
%U https://proceedings.mlr.press/v89/ali19a.html
%V 89
%X We study the statistical properties of the iterates generated by gradient descent, applied to the fundamental problem of least squares regression. We take a continuous-time view, i.e., consider infinitesimal step sizes in gradient descent, in which case the iterates form a trajectory called gradient flow.  Our primary focus is to compare the risk of gradient flow to that of ridge regression. Under the calibration $t=1/\lambda$—where $t$ is the time parameter in gradient flow, and $\lambda$ the tuning parameter in ridge regression—we prove that the risk of gradient flow is no less than 1.69 times that of ridge, along the entire path (for all $t \geq 0$). This holds in finite samples with very weak assumptions on the data model (in particular, with no assumptions on the features $X$). We prove that the same relative risk bound holds for prediction risk, in an average sense over the underlying signal $\beta_0$.   Finally, we examine limiting risk expressions (under standard Marchenko-Pastur asymptotics), and give supporting numerical experiments.

APA

Ali, A., Kolter, J.Z. & Tibshirani, R.J.. (2019). A Continuous-Time View of Early Stopping for Least Squares Regression. Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 89:1370-1378 Available from https://proceedings.mlr.press/v89/ali19a.html.

A Continuous-Time View of Early Stopping for Least Squares Regression

Abstract

Cite this Paper

Related Material