The implicit bias of gradient descent on nonseparable data

Ziwei Ji, Matus Telgarsky
Proceedings of the Thirty-Second Conference on Learning Theory, PMLR 99:1772-1798, 2019.

Abstract

Gradient descent, when applied to the task of logistic regression, outputs iterates which are biased to follow a unique ray defined by the data. The direction of this ray is the maximum margin predictor of a maximal linearly separable subset of the data; the gradient descent iterates converge to this ray in direction at the rate $\cO(\nicefrac{\ln\ln t }{\ln t})$. The ray does not pass through the origin in general, and its offset is the bounded global optimum of the risk over the remaining data; gradient descent recovers this offset at a rate $\cO(\nicefrac{(\ln t)^2}{\sqrt{t}})$.

Cite this Paper


BibTeX
@InProceedings{pmlr-v99-ji19a, title = {The implicit bias of gradient descent on nonseparable data}, author = {Ji, Ziwei and Telgarsky, Matus}, booktitle = {Proceedings of the Thirty-Second Conference on Learning Theory}, pages = {1772--1798}, year = {2019}, editor = {Beygelzimer, Alina and Hsu, Daniel}, volume = {99}, series = {Proceedings of Machine Learning Research}, month = {25--28 Jun}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v99/ji19a/ji19a.pdf}, url = {https://proceedings.mlr.press/v99/ji19a.html}, abstract = {Gradient descent, when applied to the task of logistic regression, outputs iterates which are biased to follow a unique ray defined by the data. The direction of this ray is the maximum margin predictor of a maximal linearly separable subset of the data; the gradient descent iterates converge to this ray in direction at the rate $\cO(\nicefrac{\ln\ln t }{\ln t})$. The ray does not pass through the origin in general, and its offset is the bounded global optimum of the risk over the remaining data; gradient descent recovers this offset at a rate $\cO(\nicefrac{(\ln t)^2}{\sqrt{t}})$.} }
Endnote
%0 Conference Paper %T The implicit bias of gradient descent on nonseparable data %A Ziwei Ji %A Matus Telgarsky %B Proceedings of the Thirty-Second Conference on Learning Theory %C Proceedings of Machine Learning Research %D 2019 %E Alina Beygelzimer %E Daniel Hsu %F pmlr-v99-ji19a %I PMLR %P 1772--1798 %U https://proceedings.mlr.press/v99/ji19a.html %V 99 %X Gradient descent, when applied to the task of logistic regression, outputs iterates which are biased to follow a unique ray defined by the data. The direction of this ray is the maximum margin predictor of a maximal linearly separable subset of the data; the gradient descent iterates converge to this ray in direction at the rate $\cO(\nicefrac{\ln\ln t }{\ln t})$. The ray does not pass through the origin in general, and its offset is the bounded global optimum of the risk over the remaining data; gradient descent recovers this offset at a rate $\cO(\nicefrac{(\ln t)^2}{\sqrt{t}})$.
APA
Ji, Z. & Telgarsky, M.. (2019). The implicit bias of gradient descent on nonseparable data. Proceedings of the Thirty-Second Conference on Learning Theory, in Proceedings of Machine Learning Research 99:1772-1798 Available from https://proceedings.mlr.press/v99/ji19a.html.

Related Material