Stochastic Gradient Descent for Non-smooth Optimization: Convergence Results and Optimal Averaging Schemes

Ohad Shamir; Tong Zhang

Stochastic Gradient Descent for Non-smooth Optimization: Convergence Results and Optimal Averaging Schemes

Ohad Shamir, Tong Zhang

Proceedings of the 30th International Conference on Machine Learning, PMLR 28(1):71-79, 2013.

Abstract

Stochastic Gradient Descent (SGD) is one of the simplest and most popular stochastic optimization methods. While it has already been theoretically studied for decades, the classical analysis usually required non-trivial smoothness assumptions, which do not apply to many modern applications of SGD with non-smooth objective functions such as support vector machines. In this paper, we investigate the performance of SGD \emphwithout such smoothness assumptions, as well as a running average scheme to convert the SGD iterates to a solution with optimal optimization accuracy. In this framework, we prove that after T rounds, the suboptimality of the \emphlast SGD iterate scales as O(\log(T)/\sqrtT) for non-smooth convex objective functions, and O(\log(T)/T) in the non-smooth strongly convex case. To the best of our knowledge, these are the first bounds of this kind, and almost match the minimax-optimal rates obtainable by appropriate averaging schemes. We also propose a new and simple averaging scheme, which not only attains optimal rates, but can also be easily computed on-the-fly (in contrast, the suffix averaging scheme proposed in \citetRakhShaSri12arxiv is not as simple to implement). Finally, we provide some experimental illustrations.

Cite this Paper

BibTeX


@InProceedings{pmlr-v28-shamir13,
  title = 	 {Stochastic Gradient Descent for Non-smooth Optimization: Convergence Results and Optimal Averaging Schemes},
  author = 	 {Shamir, Ohad and Zhang, Tong},
  booktitle = 	 {Proceedings of the 30th International Conference on Machine Learning},
  pages = 	 {71--79},
  year = 	 {2013},
  editor = 	 {Dasgupta, Sanjoy and McAllester, David},
  volume = 	 {28},
  number =       {1},
  series = 	 {Proceedings of Machine Learning Research},
  address = 	 {Atlanta, Georgia, USA},
  month = 	 {17--19 Jun},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v28/shamir13.pdf},
  url = 	 {https://proceedings.mlr.press/v28/shamir13.html},
  abstract = 	 {Stochastic Gradient Descent (SGD) is one of the simplest and most popular stochastic optimization methods. While it has already been theoretically studied for decades, the classical analysis usually required non-trivial smoothness assumptions, which do not apply to many modern applications of SGD with non-smooth objective functions such as support vector machines.  In this paper, we investigate the performance of SGD \emphwithout such smoothness assumptions, as well as a running average scheme to convert the SGD iterates to a solution with optimal optimization accuracy. In this framework, we prove that after T rounds, the suboptimality of the \emphlast SGD iterate scales as O(\log(T)/\sqrtT) for non-smooth convex objective functions, and O(\log(T)/T) in the non-smooth strongly convex case. To the best of our knowledge, these are the first bounds of this kind, and almost match the minimax-optimal rates obtainable by appropriate averaging schemes. We also propose a new and simple averaging scheme, which not only attains optimal rates, but can also be easily computed on-the-fly (in contrast, the suffix averaging scheme proposed in \citetRakhShaSri12arxiv is not as simple to implement). Finally, we provide some experimental illustrations.}
}

Endnote

%0 Conference Paper
%T Stochastic Gradient Descent for Non-smooth Optimization: Convergence Results and Optimal Averaging Schemes
%A Ohad Shamir
%A Tong Zhang
%B Proceedings of the 30th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2013
%E Sanjoy Dasgupta
%E David McAllester	
%F pmlr-v28-shamir13
%I PMLR
%P 71--79
%U https://proceedings.mlr.press/v28/shamir13.html
%V 28
%N 1
%X Stochastic Gradient Descent (SGD) is one of the simplest and most popular stochastic optimization methods. While it has already been theoretically studied for decades, the classical analysis usually required non-trivial smoothness assumptions, which do not apply to many modern applications of SGD with non-smooth objective functions such as support vector machines.  In this paper, we investigate the performance of SGD \emphwithout such smoothness assumptions, as well as a running average scheme to convert the SGD iterates to a solution with optimal optimization accuracy. In this framework, we prove that after T rounds, the suboptimality of the \emphlast SGD iterate scales as O(\log(T)/\sqrtT) for non-smooth convex objective functions, and O(\log(T)/T) in the non-smooth strongly convex case. To the best of our knowledge, these are the first bounds of this kind, and almost match the minimax-optimal rates obtainable by appropriate averaging schemes. We also propose a new and simple averaging scheme, which not only attains optimal rates, but can also be easily computed on-the-fly (in contrast, the suffix averaging scheme proposed in \citetRakhShaSri12arxiv is not as simple to implement). Finally, we provide some experimental illustrations.

RIS


TY  - CPAPER
TI  - Stochastic Gradient Descent for Non-smooth Optimization: Convergence Results and Optimal Averaging Schemes
AU  - Ohad Shamir
AU  - Tong Zhang
BT  - Proceedings of the 30th International Conference on Machine Learning
DA  - 2013/02/13
ED  - Sanjoy Dasgupta
ED  - David McAllester	
ID  - pmlr-v28-shamir13
PB  - PMLR
DP  - Proceedings of Machine Learning Research
VL  - 28
IS  - 1
SP  - 71
EP  - 79
L1  - http://proceedings.mlr.press/v28/shamir13.pdf
UR  - https://proceedings.mlr.press/v28/shamir13.html
AB  - Stochastic Gradient Descent (SGD) is one of the simplest and most popular stochastic optimization methods. While it has already been theoretically studied for decades, the classical analysis usually required non-trivial smoothness assumptions, which do not apply to many modern applications of SGD with non-smooth objective functions such as support vector machines.  In this paper, we investigate the performance of SGD \emphwithout such smoothness assumptions, as well as a running average scheme to convert the SGD iterates to a solution with optimal optimization accuracy. In this framework, we prove that after T rounds, the suboptimality of the \emphlast SGD iterate scales as O(\log(T)/\sqrtT) for non-smooth convex objective functions, and O(\log(T)/T) in the non-smooth strongly convex case. To the best of our knowledge, these are the first bounds of this kind, and almost match the minimax-optimal rates obtainable by appropriate averaging schemes. We also propose a new and simple averaging scheme, which not only attains optimal rates, but can also be easily computed on-the-fly (in contrast, the suffix averaging scheme proposed in \citetRakhShaSri12arxiv is not as simple to implement). Finally, we provide some experimental illustrations.
ER  -

APA


Shamir, O. & Zhang, T.. (2013). Stochastic Gradient Descent for Non-smooth Optimization: Convergence Results and Optimal Averaging Schemes. Proceedings of the 30th International Conference on Machine Learning, in Proceedings of Machine Learning Research 28(1):71-79 Available from https://proceedings.mlr.press/v28/shamir13.html.

Related Material

Download PDF