On the importance of initialization and momentum in deep learning

Ilya Sutskever; James Martens; George Dahl; Geoffrey Hinton

On the importance of initialization and momentum in deep learning

Ilya Sutskever, James Martens, George Dahl, Geoffrey Hinton

Proceedings of the 30th International Conference on Machine Learning, PMLR 28(3):1139-1147, 2013.

Abstract

Deep and recurrent neural networks (DNNs and RNNs respectively) are powerful models that were considered to be almost impossible to train using stochastic gradient descent with momentum. In this paper, we show that when stochastic gradient descent with momentum uses a well-designed random initialization and a particular type of slowly increasing schedule for the momentum parameter, it can train both DNNs and RNNs (on datasets with long-term dependencies) to levels of performance that were previously achievable only with Hessian-Free optimization. We find that both the initialization and the momentum are crucial since poorly initialized networks cannot be trained with momentum and well-initialized networks perform markedly worse when the momentum is absent or poorly tuned. Our success training these models suggests that previous attempts to train deep and recurrent neural networks from random initializations have likely failed due to poor initialization schemes. Furthermore, carefully tuned momentum methods suffice for dealing with the curvature issues in deep and recurrent network training objectives without the need for sophisticated second-order methods.

Cite this Paper

BibTeX


@InProceedings{pmlr-v28-sutskever13,
  title = 	 {On the importance of initialization and momentum in deep learning},
  author = 	 {Sutskever, Ilya and Martens, James and Dahl, George and Hinton, Geoffrey},
  booktitle = 	 {Proceedings of the 30th International Conference on Machine Learning},
  pages = 	 {1139--1147},
  year = 	 {2013},
  editor = 	 {Dasgupta, Sanjoy and McAllester, David},
  volume = 	 {28},
  number =       {3},
  series = 	 {Proceedings of Machine Learning Research},
  address = 	 {Atlanta, Georgia, USA},
  month = 	 {17--19 Jun},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v28/sutskever13.pdf},
  url = 	 {https://proceedings.mlr.press/v28/sutskever13.html},
  abstract = 	 {Deep and recurrent neural networks (DNNs and RNNs respectively) are powerful models that were considered to be almost impossible to train using stochastic gradient descent with momentum. In this paper, we show that when stochastic gradient descent with momentum uses a well-designed random initialization and a particular type of slowly increasing schedule for the momentum parameter, it can train both DNNs and RNNs (on datasets with long-term dependencies) to levels of performance that were previously achievable only with Hessian-Free optimization. We find that both the initialization and the momentum are crucial since poorly initialized networks cannot be trained with momentum and well-initialized networks perform markedly worse when the momentum is absent or poorly tuned.     Our success training these models suggests that previous attempts to train deep and recurrent neural networks from random initializations have likely failed due to poor initialization schemes. Furthermore, carefully tuned momentum methods suffice for dealing with the curvature issues in deep and recurrent network training objectives without the need for sophisticated second-order methods.   }
}

Endnote

%0 Conference Paper
%T On the importance of initialization and momentum in deep learning
%A Ilya Sutskever
%A James Martens
%A George Dahl
%A Geoffrey Hinton
%B Proceedings of the 30th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2013
%E Sanjoy Dasgupta
%E David McAllester	
%F pmlr-v28-sutskever13
%I PMLR
%P 1139--1147
%U https://proceedings.mlr.press/v28/sutskever13.html
%V 28
%N 3
%X Deep and recurrent neural networks (DNNs and RNNs respectively) are powerful models that were considered to be almost impossible to train using stochastic gradient descent with momentum. In this paper, we show that when stochastic gradient descent with momentum uses a well-designed random initialization and a particular type of slowly increasing schedule for the momentum parameter, it can train both DNNs and RNNs (on datasets with long-term dependencies) to levels of performance that were previously achievable only with Hessian-Free optimization. We find that both the initialization and the momentum are crucial since poorly initialized networks cannot be trained with momentum and well-initialized networks perform markedly worse when the momentum is absent or poorly tuned.     Our success training these models suggests that previous attempts to train deep and recurrent neural networks from random initializations have likely failed due to poor initialization schemes. Furthermore, carefully tuned momentum methods suffice for dealing with the curvature issues in deep and recurrent network training objectives without the need for sophisticated second-order methods.

RIS


TY  - CPAPER
TI  - On the importance of initialization and momentum in deep learning
AU  - Ilya Sutskever
AU  - James Martens
AU  - George Dahl
AU  - Geoffrey Hinton
BT  - Proceedings of the 30th International Conference on Machine Learning
DA  - 2013/05/26
ED  - Sanjoy Dasgupta
ED  - David McAllester	
ID  - pmlr-v28-sutskever13
PB  - PMLR
DP  - Proceedings of Machine Learning Research
VL  - 28
IS  - 3
SP  - 1139
EP  - 1147
L1  - http://proceedings.mlr.press/v28/sutskever13.pdf
UR  - https://proceedings.mlr.press/v28/sutskever13.html
AB  - Deep and recurrent neural networks (DNNs and RNNs respectively) are powerful models that were considered to be almost impossible to train using stochastic gradient descent with momentum. In this paper, we show that when stochastic gradient descent with momentum uses a well-designed random initialization and a particular type of slowly increasing schedule for the momentum parameter, it can train both DNNs and RNNs (on datasets with long-term dependencies) to levels of performance that were previously achievable only with Hessian-Free optimization. We find that both the initialization and the momentum are crucial since poorly initialized networks cannot be trained with momentum and well-initialized networks perform markedly worse when the momentum is absent or poorly tuned.     Our success training these models suggests that previous attempts to train deep and recurrent neural networks from random initializations have likely failed due to poor initialization schemes. Furthermore, carefully tuned momentum methods suffice for dealing with the curvature issues in deep and recurrent network training objectives without the need for sophisticated second-order methods.   
ER  -

APA


Sutskever, I., Martens, J., Dahl, G. & Hinton, G.. (2013). On the importance of initialization and momentum in deep learning. Proceedings of the 30th International Conference on Machine Learning, in Proceedings of Machine Learning Research 28(3):1139-1147 Available from https://proceedings.mlr.press/v28/sutskever13.html.

On the importance of initialization and momentum in deep learning

Abstract

Cite this Paper

Related Material