Dynamical Isometry is Achieved in Residual Networks in a Universal Way for any Activation Function

Wojciech Tarnowski, Piotr Warchoł, Stanisław Jastrzȩbski, Jacek Tabor, Maciej Nowak
Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, PMLR 89:2221-2230, 2019.

Abstract

We demonstrate that in residual neural networks (ResNets) dynamical isometry is achievable irrespective of the activation function used. We do that by deriving, with the help of Free Probability and Random Matrix Theories, a universal formula for the spectral density of the input-output Jacobian at initialization, in the large network width and depth limit. The resulting singular value spectrum depends on a single parameter, which we calculate for a variety of popular activation functions, by analyzing the signal propagation in the artificial neural network. We corroborate our results with numerical simulations of both random matrices and ResNets applied to the CIFAR-10 classification problem. Moreover, we study consequences of this universal behavior for the initial and late phases of the learning processes. We conclude by drawing attention to the simple fact, that initialization acts as a confounding factor between the choice of activation function and the rate of learning. We propose that in ResNets this can be resolved based on our results by ensuring the same level of dynamical isometry at initialization.

Cite this Paper


BibTeX
@InProceedings{pmlr-v89-tarnowski19a, title = {Dynamical Isometry is Achieved in Residual Networks in a Universal Way for any Activation Function}, author = {Tarnowski, Wojciech and Warcho\l, Piotr and Jastrz\c{e}bski, Stanis{\l}aw and Tabor, Jacek and Nowak, Maciej}, booktitle = {Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics}, pages = {2221--2230}, year = {2019}, editor = {Chaudhuri, Kamalika and Sugiyama, Masashi}, volume = {89}, series = {Proceedings of Machine Learning Research}, month = {16--18 Apr}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v89/tarnowski19a/tarnowski19a.pdf}, url = {https://proceedings.mlr.press/v89/tarnowski19a.html}, abstract = {We demonstrate that in residual neural networks (ResNets) dynamical isometry is achievable irrespective of the activation function used. We do that by deriving, with the help of Free Probability and Random Matrix Theories, a universal formula for the spectral density of the input-output Jacobian at initialization, in the large network width and depth limit. The resulting singular value spectrum depends on a single parameter, which we calculate for a variety of popular activation functions, by analyzing the signal propagation in the artificial neural network. We corroborate our results with numerical simulations of both random matrices and ResNets applied to the CIFAR-10 classification problem. Moreover, we study consequences of this universal behavior for the initial and late phases of the learning processes. We conclude by drawing attention to the simple fact, that initialization acts as a confounding factor between the choice of activation function and the rate of learning. We propose that in ResNets this can be resolved based on our results by ensuring the same level of dynamical isometry at initialization.} }
Endnote
%0 Conference Paper %T Dynamical Isometry is Achieved in Residual Networks in a Universal Way for any Activation Function %A Wojciech Tarnowski %A Piotr Warchoł %A Stanisław Jastrzȩbski %A Jacek Tabor %A Maciej Nowak %B Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics %C Proceedings of Machine Learning Research %D 2019 %E Kamalika Chaudhuri %E Masashi Sugiyama %F pmlr-v89-tarnowski19a %I PMLR %P 2221--2230 %U https://proceedings.mlr.press/v89/tarnowski19a.html %V 89 %X We demonstrate that in residual neural networks (ResNets) dynamical isometry is achievable irrespective of the activation function used. We do that by deriving, with the help of Free Probability and Random Matrix Theories, a universal formula for the spectral density of the input-output Jacobian at initialization, in the large network width and depth limit. The resulting singular value spectrum depends on a single parameter, which we calculate for a variety of popular activation functions, by analyzing the signal propagation in the artificial neural network. We corroborate our results with numerical simulations of both random matrices and ResNets applied to the CIFAR-10 classification problem. Moreover, we study consequences of this universal behavior for the initial and late phases of the learning processes. We conclude by drawing attention to the simple fact, that initialization acts as a confounding factor between the choice of activation function and the rate of learning. We propose that in ResNets this can be resolved based on our results by ensuring the same level of dynamical isometry at initialization.
APA
Tarnowski, W., Warchoł, P., Jastrzȩbski, S., Tabor, J. & Nowak, M.. (2019). Dynamical Isometry is Achieved in Residual Networks in a Universal Way for any Activation Function. Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, in Proceedings of Machine Learning Research 89:2221-2230 Available from https://proceedings.mlr.press/v89/tarnowski19a.html.

Related Material