The implicit bias of gradient descent on nonseparable data

Ziwei Ji; Matus Telgarsky

The implicit bias of gradient descent on nonseparable data

Ziwei Ji, Matus Telgarsky

Proceedings of the Thirty-Second Conference on Learning Theory, PMLR 99:1772-1798, 2019.

Abstract

Gradient descent, when applied to the task of logistic regression, outputs iterates which are biased to follow a unique ray defined by the data. The direction of this ray is the maximum margin predictor of a maximal linearly separable subset of the data; the gradient descent iterates converge to this ray in direction at the rate $\cO(\nicefrac{\ln\ln t }{\ln t})$. The ray does not pass through the origin in general, and its offset is the bounded global optimum of the risk over the remaining data; gradient descent recovers this offset at a rate $\cO(\nicefrac{(\ln t)^2}{\sqrt{t}})$.

Cite this Paper

BibTeX


@InProceedings{pmlr-v99-ji19a,
  title = 	 {The implicit bias of gradient descent on nonseparable data},
  author =       {Ji, Ziwei and Telgarsky, Matus},
  booktitle = 	 {Proceedings of the Thirty-Second Conference on Learning Theory},
  pages = 	 {1772--1798},
  year = 	 {2019},
  editor = 	 {Beygelzimer, Alina and Hsu, Daniel},
  volume = 	 {99},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {25--28 Jun},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v99/ji19a/ji19a.pdf},
  url = 	 {https://proceedings.mlr.press/v99/ji19a.html},
  abstract = 	 {Gradient descent, when applied to the task of logistic regression, outputs iterates which are biased to follow a unique ray defined by the data. The direction of this ray is the maximum margin predictor of a maximal linearly separable subset of the data; the gradient descent iterates converge to this ray in direction at the rate $\cO(\nicefrac{\ln\ln t }{\ln t})$. The ray does not pass through the origin in general, and its offset is the bounded global optimum of the risk over the remaining data; gradient descent recovers this offset at a rate $\cO(\nicefrac{(\ln t)^2}{\sqrt{t}})$.}
}

Endnote

%0 Conference Paper
%T The implicit bias of gradient descent on nonseparable data
%A Ziwei Ji
%A Matus Telgarsky
%B Proceedings of the Thirty-Second Conference on Learning Theory
%C Proceedings of Machine Learning Research
%D 2019
%E Alina Beygelzimer
%E Daniel Hsu	
%F pmlr-v99-ji19a
%I PMLR
%P 1772--1798
%U https://proceedings.mlr.press/v99/ji19a.html
%V 99
%X Gradient descent, when applied to the task of logistic regression, outputs iterates which are biased to follow a unique ray defined by the data. The direction of this ray is the maximum margin predictor of a maximal linearly separable subset of the data; the gradient descent iterates converge to this ray in direction at the rate $\cO(\nicefrac{\ln\ln t }{\ln t})$. The ray does not pass through the origin in general, and its offset is the bounded global optimum of the risk over the remaining data; gradient descent recovers this offset at a rate $\cO(\nicefrac{(\ln t)^2}{\sqrt{t}})$.

APA


Ji, Z. & Telgarsky, M.. (2019). The implicit bias of gradient descent on nonseparable data. Proceedings of the Thirty-Second Conference on Learning Theory, in Proceedings of Machine Learning Research 99:1772-1798 Available from https://proceedings.mlr.press/v99/ji19a.html.

Related Material

Download PDF