The implicit bias of gradient descent on nonseparable data
[edit]
Proceedings of the ThirtySecond Conference on Learning Theory, PMLR 99:17721798, 2019.
Abstract
Gradient descent, when applied to the task of logistic regression, outputs iterates which are biased to follow a unique ray defined by the data. The direction of this ray is the maximum margin predictor of a maximal linearly separable subset of the data; the gradient descent iterates converge to this ray in direction at the rate $\cO(\nicefrac{\ln\ln t }{\ln t})$. The ray does not pass through the origin in general, and its offset is the bounded global optimum of the risk over the remaining data; gradient descent recovers this offset at a rate $\cO(\nicefrac{(\ln t)^2}{\sqrt{t}})$.
Related Material


