Revisiting Weight Initialization of Deep Neural Networks

Maciej Skorski, Alessandro Temperoni, Martin Theobald
Proceedings of The 13th Asian Conference on Machine Learning, PMLR 157:1192-1207, 2021.

Abstract

The proper {\em initialization of weights} is crucial for the effective training and fast convergence of {\em deep neural networks} (DNNs). Prior work in this area has mostly focused on the principle of {\em balancing the variance among weights per layer} to maintain stability of (i) the input data propagated forwards through the network, and (ii) the loss gradients propagated backwards, respectively. This prevalent heuristic is however agnostic of dependencies among gradients across the various layers and captures only first-order effects per layer. In this paper, we investigate a {\em unifying approach}, based on approximating and controlling the {\em norm of the layers’ Hessians}, which both generalizes and explains existing initialization schemes such as {\em smooth activation functions}, {\em Dropouts}, and {\em ReLU}.

Cite this Paper


BibTeX
@InProceedings{pmlr-v157-skorski21a, title = {Revisiting Weight Initialization of Deep Neural Networks}, author = {Skorski, Maciej and Temperoni, Alessandro and Theobald, Martin}, booktitle = {Proceedings of The 13th Asian Conference on Machine Learning}, pages = {1192--1207}, year = {2021}, editor = {Balasubramanian, Vineeth N. and Tsang, Ivor}, volume = {157}, series = {Proceedings of Machine Learning Research}, month = {17--19 Nov}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v157/skorski21a/skorski21a.pdf}, url = {https://proceedings.mlr.press/v157/skorski21a.html}, abstract = {The proper {\em initialization of weights} is crucial for the effective training and fast convergence of {\em deep neural networks} (DNNs). Prior work in this area has mostly focused on the principle of {\em balancing the variance among weights per layer} to maintain stability of (i) the input data propagated forwards through the network, and (ii) the loss gradients propagated backwards, respectively. This prevalent heuristic is however agnostic of dependencies among gradients across the various layers and captures only first-order effects per layer. In this paper, we investigate a {\em unifying approach}, based on approximating and controlling the {\em norm of the layers’ Hessians}, which both generalizes and explains existing initialization schemes such as {\em smooth activation functions}, {\em Dropouts}, and {\em ReLU}.} }
Endnote
%0 Conference Paper %T Revisiting Weight Initialization of Deep Neural Networks %A Maciej Skorski %A Alessandro Temperoni %A Martin Theobald %B Proceedings of The 13th Asian Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2021 %E Vineeth N. Balasubramanian %E Ivor Tsang %F pmlr-v157-skorski21a %I PMLR %P 1192--1207 %U https://proceedings.mlr.press/v157/skorski21a.html %V 157 %X The proper {\em initialization of weights} is crucial for the effective training and fast convergence of {\em deep neural networks} (DNNs). Prior work in this area has mostly focused on the principle of {\em balancing the variance among weights per layer} to maintain stability of (i) the input data propagated forwards through the network, and (ii) the loss gradients propagated backwards, respectively. This prevalent heuristic is however agnostic of dependencies among gradients across the various layers and captures only first-order effects per layer. In this paper, we investigate a {\em unifying approach}, based on approximating and controlling the {\em norm of the layers’ Hessians}, which both generalizes and explains existing initialization schemes such as {\em smooth activation functions}, {\em Dropouts}, and {\em ReLU}.
APA
Skorski, M., Temperoni, A. & Theobald, M.. (2021). Revisiting Weight Initialization of Deep Neural Networks. Proceedings of The 13th Asian Conference on Machine Learning, in Proceedings of Machine Learning Research 157:1192-1207 Available from https://proceedings.mlr.press/v157/skorski21a.html.

Related Material