Analyzing Finite Neural Networks: Can We Trust Neural Tangent Kernel Theory?

Mariia Seleznova, Gitta Kutyniok
Proceedings of the 2nd Mathematical and Scientific Machine Learning Conference, PMLR 145:868-895, 2022.

Abstract

Neural Tangent Kernel (NTK) theory is widely used to study the dynamics of infinitely-wide deep neural networks (DNNs) under gradient descent. But do the results for infinitely-wide networks give us hints about the behavior of real finite-width ones? In this paper, we study empirically when NTK theory is valid in practice for fully-connected ReLU and sigmoid DNNs. We find out that whether a network is in the NTK regime depends on the hyperparameters of random initialization and the network’s depth. In particular, NTK theory does not explain the behavior of sufficiently deep networks initialized so that their gradients explode as they propagate through the network’s layers: the kernel is random at initialization and changes significantly during training in this case, contrary to NTK theory. On the other hand, in the case of vanishing gradients, DNNs are in the the NTK regime but become untrainable rapidly with depth. We also describe a framework to study generalization properties of DNNs, in particular the variance of network’s output function, by means of NTK theory and discuss its limits.

Cite this Paper


BibTeX
@InProceedings{pmlr-v145-seleznova22a, title = {Analyzing Finite Neural Networks: Can We Trust Neural Tangent Kernel Theory?}, author = {Seleznova, Mariia and Kutyniok, Gitta}, booktitle = {Proceedings of the 2nd Mathematical and Scientific Machine Learning Conference}, pages = {868--895}, year = {2022}, editor = {Bruna, Joan and Hesthaven, Jan and Zdeborova, Lenka}, volume = {145}, series = {Proceedings of Machine Learning Research}, month = {16--19 Aug}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v145/seleznova22a/seleznova22a.pdf}, url = {https://proceedings.mlr.press/v145/seleznova22a.html}, abstract = {Neural Tangent Kernel (NTK) theory is widely used to study the dynamics of infinitely-wide deep neural networks (DNNs) under gradient descent. But do the results for infinitely-wide networks give us hints about the behavior of real finite-width ones? In this paper, we study empirically when NTK theory is valid in practice for fully-connected ReLU and sigmoid DNNs. We find out that whether a network is in the NTK regime depends on the hyperparameters of random initialization and the network’s depth. In particular, NTK theory does not explain the behavior of sufficiently deep networks initialized so that their gradients explode as they propagate through the network’s layers: the kernel is random at initialization and changes significantly during training in this case, contrary to NTK theory. On the other hand, in the case of vanishing gradients, DNNs are in the the NTK regime but become untrainable rapidly with depth. We also describe a framework to study generalization properties of DNNs, in particular the variance of network’s output function, by means of NTK theory and discuss its limits.} }
Endnote
%0 Conference Paper %T Analyzing Finite Neural Networks: Can We Trust Neural Tangent Kernel Theory? %A Mariia Seleznova %A Gitta Kutyniok %B Proceedings of the 2nd Mathematical and Scientific Machine Learning Conference %C Proceedings of Machine Learning Research %D 2022 %E Joan Bruna %E Jan Hesthaven %E Lenka Zdeborova %F pmlr-v145-seleznova22a %I PMLR %P 868--895 %U https://proceedings.mlr.press/v145/seleznova22a.html %V 145 %X Neural Tangent Kernel (NTK) theory is widely used to study the dynamics of infinitely-wide deep neural networks (DNNs) under gradient descent. But do the results for infinitely-wide networks give us hints about the behavior of real finite-width ones? In this paper, we study empirically when NTK theory is valid in practice for fully-connected ReLU and sigmoid DNNs. We find out that whether a network is in the NTK regime depends on the hyperparameters of random initialization and the network’s depth. In particular, NTK theory does not explain the behavior of sufficiently deep networks initialized so that their gradients explode as they propagate through the network’s layers: the kernel is random at initialization and changes significantly during training in this case, contrary to NTK theory. On the other hand, in the case of vanishing gradients, DNNs are in the the NTK regime but become untrainable rapidly with depth. We also describe a framework to study generalization properties of DNNs, in particular the variance of network’s output function, by means of NTK theory and discuss its limits.
APA
Seleznova, M. & Kutyniok, G.. (2022). Analyzing Finite Neural Networks: Can We Trust Neural Tangent Kernel Theory?. Proceedings of the 2nd Mathematical and Scientific Machine Learning Conference, in Proceedings of Machine Learning Research 145:868-895 Available from https://proceedings.mlr.press/v145/seleznova22a.html.

Related Material