Large Stepsize Gradient Descent for Logistic Loss: Non-Monotonicity of the Loss Improves Optimization Efficiency

Jingfeng Wu; Peter L. Bartlett; Matus Telgarsky; Bin Yu

Large Stepsize Gradient Descent for Logistic Loss: Non-Monotonicity of the Loss Improves Optimization Efficiency

Jingfeng Wu, Peter L. Bartlett, Matus Telgarsky, Bin Yu

Proceedings of Thirty Seventh Conference on Learning Theory, PMLR 247:5019-5073, 2024.

Abstract

We consider \emph{gradient descent} (GD) with a constant stepsize applied to logistic regression with linearly separable data, where the constant stepsize

$\eta$ is so large that the loss initially oscillates. We show that GD exits this initial oscillatory phase rapidly — in

$O(\eta)$ steps, and subsequently achieves an

$\tilde{O}(1 / (\eta t) )$ convergence rate after

$t$ additional steps. Our results imply that, given a budget of

$T$ steps, GD can achieve an \emph{accelerated} loss of

$\tilde{O}(1/T^2)$ with an aggressive stepsize

$\eta:= \Theta( T)$ , without any use of momentum or variable stepsize schedulers. Our proof technique is versatile and also handles general classification loss functions (where exponential tails are needed for the

$\tilde{O}(1/T^2)$ acceleration), nonlinear predictors in the \emph{neural tangent kernel} regime, and online \emph{stochastic gradient descent} (SGD) with a large stepsize, under suitable separability conditions.

Cite this Paper

BibTeX


@InProceedings{pmlr-v247-wu24b,
  title = 	 {Large Stepsize Gradient Descent for Logistic Loss: Non-Monotonicity of the Loss Improves Optimization Efficiency},
  author =       {Wu, Jingfeng and Bartlett, Peter L. and Telgarsky, Matus and Yu, Bin},
  booktitle = 	 {Proceedings of Thirty Seventh Conference on Learning Theory},
  pages = 	 {5019--5073},
  year = 	 {2024},
  editor = 	 {Agrawal, Shipra and Roth, Aaron},
  volume = 	 {247},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {30 Jun--03 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v247/wu24b/wu24b.pdf},
  url = 	 {https://proceedings.mlr.press/v247/wu24b.html},
  abstract = 	 {We consider \emph{gradient descent} (GD) with a constant stepsize applied to logistic regression with linearly separable data, where the constant stepsize $\eta$ is so large that the loss initially oscillates. We show that GD exits this initial oscillatory phase rapidly — in $O(\eta)$ steps, and subsequently achieves an $\tilde{O}(1 / (\eta t) )$ convergence rate after $t$ additional steps. Our results imply that, given a budget of $T$ steps, GD can achieve an \emph{accelerated} loss of $\tilde{O}(1/T^2)$ with an aggressive stepsize $\eta:= \Theta( T)$, without any use of momentum or variable stepsize schedulers. Our proof technique is versatile and also handles general classification loss functions (where exponential tails are needed for the $\tilde{O}(1/T^2)$ acceleration), nonlinear predictors in the \emph{neural tangent kernel} regime, and online \emph{stochastic gradient descent} (SGD) with a large stepsize, under suitable separability conditions.}
}

Endnote

%0 Conference Paper
%T Large Stepsize Gradient Descent for Logistic Loss: Non-Monotonicity of the Loss Improves Optimization Efficiency
%A Jingfeng Wu
%A Peter L. Bartlett
%A Matus Telgarsky
%A Bin Yu
%B Proceedings of Thirty Seventh Conference on Learning Theory
%C Proceedings of Machine Learning Research
%D 2024
%E Shipra Agrawal
%E Aaron Roth	
%F pmlr-v247-wu24b
%I PMLR
%P 5019--5073
%U https://proceedings.mlr.press/v247/wu24b.html
%V 247
%X We consider \emph{gradient descent} (GD) with a constant stepsize applied to logistic regression with linearly separable data, where the constant stepsize $\eta$ is so large that the loss initially oscillates. We show that GD exits this initial oscillatory phase rapidly — in $O(\eta)$ steps, and subsequently achieves an $\tilde{O}(1 / (\eta t) )$ convergence rate after $t$ additional steps. Our results imply that, given a budget of $T$ steps, GD can achieve an \emph{accelerated} loss of $\tilde{O}(1/T^2)$ with an aggressive stepsize $\eta:= \Theta( T)$, without any use of momentum or variable stepsize schedulers. Our proof technique is versatile and also handles general classification loss functions (where exponential tails are needed for the $\tilde{O}(1/T^2)$ acceleration), nonlinear predictors in the \emph{neural tangent kernel} regime, and online \emph{stochastic gradient descent} (SGD) with a large stepsize, under suitable separability conditions.

APA


Wu, J., Bartlett, P.L., Telgarsky, M. & Yu, B.. (2024). Large Stepsize Gradient Descent for Logistic Loss: Non-Monotonicity of the Loss Improves Optimization Efficiency. Proceedings of Thirty Seventh Conference on Learning Theory, in Proceedings of Machine Learning Research 247:5019-5073 Available from https://proceedings.mlr.press/v247/wu24b.html.

Related Material

Download PDF