Convergence of Gradient Descent on Separable Data

Mor Shpigel Nacson; Jason Lee; Suriya Gunasekar; Pedro Henrique Pamplona Savarese; Nathan Srebro; Daniel Soudry

Convergence of Gradient Descent on Separable Data

Mor Shpigel Nacson, Jason Lee, Suriya Gunasekar, Pedro Henrique Pamplona Savarese, Nathan Srebro, Daniel Soudry

Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, PMLR 89:3420-3428, 2019.

Abstract

We provide a detailed study on the implicit bias of gradient descent when optimizing loss functions with strictly monotone tails, such as the logistic loss, over separable datasets. We look at two basic questions: (a) what are the conditions on the tail of the loss function under which gradient descent converges in the direction of the $L_2$ maximum-margin separator? (b) how does the rate of margin convergence depend on the tail of the loss function and the choice of the step size? We show that for a large family of super-polynomial tailed losses, gradient descent iterates on linear networks of any depth converge in the direction of $L_2$ maximum-margin solution, while this does not hold for losses with heavier tails. Within this family, for simple linear models we show that the optimal rates with fixed step size is indeed obtained for the commonly used exponentially tailed losses such as logistic loss. However, with a fixed step size the optimal convergence rate is extremely slow as $1/\log(t)$, as also proved in Soudry et al (2018). For linear models with exponential loss, we further prove that the convergence rate could be improved to $\log (t) /\sqrt{t}$ by using aggressive step sizes that compensates for the rapidly vanishing gradients. Numerical results suggest this method might be useful for deep networks.

Cite this Paper

BibTeX

@InProceedings{pmlr-v89-nacson19b,
  title = 	 {Convergence of Gradient Descent on Separable Data},
  author =       {Nacson, Mor Shpigel and Lee, Jason and Gunasekar, Suriya and Savarese, Pedro Henrique Pamplona and Srebro, Nathan and Soudry, Daniel},
  booktitle = 	 {Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics},
  pages = 	 {3420--3428},
  year = 	 {2019},
  editor = 	 {Chaudhuri, Kamalika and Sugiyama, Masashi},
  volume = 	 {89},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {16--18 Apr},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v89/nacson19b/nacson19b.pdf},
  url = 	 {https://proceedings.mlr.press/v89/nacson19b.html},
  abstract = 	 {We provide a detailed study on the implicit bias of gradient descent when optimizing loss functions with strictly monotone tails, such as the logistic loss, over separable datasets. We look at two basic questions: (a) what are the conditions on the tail of the loss function under which gradient descent converges in the direction of the $L_2$ maximum-margin separator? (b) how does the rate of margin convergence depend on the tail of the loss function and the choice of the step size? We  show that for a large family of super-polynomial tailed losses, gradient descent iterates on linear networks of any depth converge in the direction of $L_2$ maximum-margin solution, while this does not hold for losses with heavier tails. Within this family, for simple linear models we show that the optimal rates with fixed step size is indeed obtained for the commonly used exponentially tailed losses such as logistic loss. However, with a fixed step size the optimal convergence rate is extremely slow as $1/\log(t)$, as also proved in Soudry et al (2018). For linear models with exponential loss, we further prove that the convergence rate could be improved to $\log (t) /\sqrt{t}$  by using aggressive step sizes that compensates for the rapidly vanishing gradients. Numerical results suggest this method might be useful for deep networks.}
}

Endnote

%0 Conference Paper
%T Convergence of Gradient Descent on Separable Data
%A Mor Shpigel Nacson
%A Jason Lee
%A Suriya Gunasekar
%A Pedro Henrique Pamplona Savarese
%A Nathan Srebro
%A Daniel Soudry
%B Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics
%C Proceedings of Machine Learning Research
%D 2019
%E Kamalika Chaudhuri
%E Masashi Sugiyama	
%F pmlr-v89-nacson19b
%I PMLR
%P 3420--3428
%U https://proceedings.mlr.press/v89/nacson19b.html
%V 89
%X We provide a detailed study on the implicit bias of gradient descent when optimizing loss functions with strictly monotone tails, such as the logistic loss, over separable datasets. We look at two basic questions: (a) what are the conditions on the tail of the loss function under which gradient descent converges in the direction of the $L_2$ maximum-margin separator? (b) how does the rate of margin convergence depend on the tail of the loss function and the choice of the step size? We  show that for a large family of super-polynomial tailed losses, gradient descent iterates on linear networks of any depth converge in the direction of $L_2$ maximum-margin solution, while this does not hold for losses with heavier tails. Within this family, for simple linear models we show that the optimal rates with fixed step size is indeed obtained for the commonly used exponentially tailed losses such as logistic loss. However, with a fixed step size the optimal convergence rate is extremely slow as $1/\log(t)$, as also proved in Soudry et al (2018). For linear models with exponential loss, we further prove that the convergence rate could be improved to $\log (t) /\sqrt{t}$  by using aggressive step sizes that compensates for the rapidly vanishing gradients. Numerical results suggest this method might be useful for deep networks.

APA

Nacson, M.S., Lee, J., Gunasekar, S., Savarese, P.H.P., Srebro, N. & Soudry, D.. (2019). Convergence of Gradient Descent on Separable Data. Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 89:3420-3428 Available from https://proceedings.mlr.press/v89/nacson19b.html.

Convergence of Gradient Descent on Separable Data

Abstract

Cite this Paper

Related Material