Tight Bounds for Logistic Regression with Large Stepsize Gradient Descent in Low Dimension

Michael Crawshaw, Mingrui Liu
Proceedings of Thirty Ninth Conference on Learning Theory, PMLR 336:1575-1610, 2026.

Abstract

We consider the optimization problem of minimizing the logistic loss with gradient descent to train a linear model for binary classification with separable data. With a budget of $T$ iterations, it was recently shown that an accelerated $1/T^2$ rate is possible by choosing a large stepsize $\eta = \Theta(\gamma^2 T)$ (where $\gamma$ is the dataset’s margin) despite the resulting non-monotonicity of the loss. In this paper, we provide a tighter analysis of gradient descent for this problem when the data is two-dimensional: we show that GD with a sufficiently large learning rate $\eta$ finds a point with loss smaller than $\mathcal{O}(1/(\eta \gamma^2 T))$, as long as $T \geq \Omega(n/\gamma + 1/\gamma^2)$, where $n$ is the dataset size. Our improved rate comes from a tighter bound on the time $\tau$ that it takes for GD to transition from unstable (non-monotonic loss) to stable (monotonic loss), via a fine-grained analysis of the oscillatory dynamics of GD in the subspace orthogonal to the max-margin classifier. We also provide a lower bound of $\tau$ matching our upper bound up to logarithmic factors, showing that our analysis is tight.

Cite this Paper


BibTeX
@InProceedings{pmlr-v336-crawshaw26a, title = {Tight Bounds for Logistic Regression with Large Stepsize Gradient Descent in Low Dimension}, author = {Crawshaw, Michael and Liu, Mingrui}, booktitle = {Proceedings of Thirty Ninth Conference on Learning Theory}, pages = {1575--1610}, year = {2026}, editor = {Hanneke, Steve and Lattimore, Tor}, volume = {336}, series = {Proceedings of Machine Learning Research}, month = {29 Jun--03 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v336/main/assets/crawshaw26a/crawshaw26a.pdf}, url = {https://proceedings.mlr.press/v336/crawshaw26a.html}, abstract = {We consider the optimization problem of minimizing the logistic loss with gradient descent to train a linear model for binary classification with separable data. With a budget of $T$ iterations, it was recently shown that an accelerated $1/T^2$ rate is possible by choosing a large stepsize $\eta = \Theta(\gamma^2 T)$ (where $\gamma$ is the dataset’s margin) despite the resulting non-monotonicity of the loss. In this paper, we provide a tighter analysis of gradient descent for this problem when the data is two-dimensional: we show that GD with a sufficiently large learning rate $\eta$ finds a point with loss smaller than $\mathcal{O}(1/(\eta \gamma^2 T))$, as long as $T \geq \Omega(n/\gamma + 1/\gamma^2)$, where $n$ is the dataset size. Our improved rate comes from a tighter bound on the time $\tau$ that it takes for GD to transition from unstable (non-monotonic loss) to stable (monotonic loss), via a fine-grained analysis of the oscillatory dynamics of GD in the subspace orthogonal to the max-margin classifier. We also provide a lower bound of $\tau$ matching our upper bound up to logarithmic factors, showing that our analysis is tight.} }
Endnote
%0 Conference Paper %T Tight Bounds for Logistic Regression with Large Stepsize Gradient Descent in Low Dimension %A Michael Crawshaw %A Mingrui Liu %B Proceedings of Thirty Ninth Conference on Learning Theory %C Proceedings of Machine Learning Research %D 2026 %E Steve Hanneke %E Tor Lattimore %F pmlr-v336-crawshaw26a %I PMLR %P 1575--1610 %U https://proceedings.mlr.press/v336/crawshaw26a.html %V 336 %X We consider the optimization problem of minimizing the logistic loss with gradient descent to train a linear model for binary classification with separable data. With a budget of $T$ iterations, it was recently shown that an accelerated $1/T^2$ rate is possible by choosing a large stepsize $\eta = \Theta(\gamma^2 T)$ (where $\gamma$ is the dataset’s margin) despite the resulting non-monotonicity of the loss. In this paper, we provide a tighter analysis of gradient descent for this problem when the data is two-dimensional: we show that GD with a sufficiently large learning rate $\eta$ finds a point with loss smaller than $\mathcal{O}(1/(\eta \gamma^2 T))$, as long as $T \geq \Omega(n/\gamma + 1/\gamma^2)$, where $n$ is the dataset size. Our improved rate comes from a tighter bound on the time $\tau$ that it takes for GD to transition from unstable (non-monotonic loss) to stable (monotonic loss), via a fine-grained analysis of the oscillatory dynamics of GD in the subspace orthogonal to the max-margin classifier. We also provide a lower bound of $\tau$ matching our upper bound up to logarithmic factors, showing that our analysis is tight.
APA
Crawshaw, M. & Liu, M.. (2026). Tight Bounds for Logistic Regression with Large Stepsize Gradient Descent in Low Dimension. Proceedings of Thirty Ninth Conference on Learning Theory, in Proceedings of Machine Learning Research 336:1575-1610 Available from https://proceedings.mlr.press/v336/crawshaw26a.html.

Related Material