[edit]
Large Stepsize Gradient Descent for Logistic Loss: Non-Monotonicity of the Loss Improves Optimization Efficiency
Proceedings of Thirty Seventh Conference on Learning Theory, PMLR 247:5019-5073, 2024.
Abstract
We consider \emph{gradient descent} (GD) with a constant stepsize applied to logistic regression with linearly separable data, where the constant stepsize η is so large that the loss initially oscillates. We show that GD exits this initial oscillatory phase rapidly — in O(η) steps, and subsequently achieves an ˜O(1/(ηt)) convergence rate after t additional steps. Our results imply that, given a budget of T steps, GD can achieve an \emph{accelerated} loss of ˜O(1/T2) with an aggressive stepsize η:=Θ(T), without any use of momentum or variable stepsize schedulers. Our proof technique is versatile and also handles general classification loss functions (where exponential tails are needed for the ˜O(1/T2) acceleration), nonlinear predictors in the \emph{neural tangent kernel} regime, and online \emph{stochastic gradient descent} (SGD) with a large stepsize, under suitable separability conditions.