On random kernels of residual architectures

Etai Littwin, Tomer Galanti, Lior Wolf
Proceedings of the Thirty-Seventh Conference on Uncertainty in Artificial Intelligence, PMLR 161:897-907, 2021.

Abstract

We analyze the finite corrections to the neural tangent kernel (NTK) of residual and densely connected networks, as a function of both depth and width. Surprisingly, our analysis reveals that given a fixed depth, residual networks provide the best tradeoff between the parameter complexity and the coefficient of variation (normalized variance), followed by densely connected networks and vanilla MLPs. While in networks that do not use skip connections, convergence to the NTK requires one to fix the depth, while increasing the layers’ width. Our findings show that in ResNets, convergence to the NTK may occur when depth and width simultaneously tend to infinity, provided with a proper initialization. In DenseNets, however, the convergence of the NTK to its limit as the width tends to infinity is guaranteed, at a rate that is independent of both the depth and scale of the weights. Our experiments validate the theoretical results and demonstrate the advantage of deep ResNets and DenseNets for kernel regression with random gradient features.

Cite this Paper


BibTeX
@InProceedings{pmlr-v161-littwin21a, title = {On random kernels of residual architectures}, author = {Littwin, Etai and Galanti, Tomer and Wolf, Lior}, booktitle = {Proceedings of the Thirty-Seventh Conference on Uncertainty in Artificial Intelligence}, pages = {897--907}, year = {2021}, editor = {de Campos, Cassio and Maathuis, Marloes H.}, volume = {161}, series = {Proceedings of Machine Learning Research}, month = {27--30 Jul}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v161/littwin21a/littwin21a.pdf}, url = {https://proceedings.mlr.press/v161/littwin21a.html}, abstract = {We analyze the finite corrections to the neural tangent kernel (NTK) of residual and densely connected networks, as a function of both depth and width. Surprisingly, our analysis reveals that given a fixed depth, residual networks provide the best tradeoff between the parameter complexity and the coefficient of variation (normalized variance), followed by densely connected networks and vanilla MLPs. While in networks that do not use skip connections, convergence to the NTK requires one to fix the depth, while increasing the layers’ width. Our findings show that in ResNets, convergence to the NTK may occur when depth and width simultaneously tend to infinity, provided with a proper initialization. In DenseNets, however, the convergence of the NTK to its limit as the width tends to infinity is guaranteed, at a rate that is independent of both the depth and scale of the weights. Our experiments validate the theoretical results and demonstrate the advantage of deep ResNets and DenseNets for kernel regression with random gradient features.} }
Endnote
%0 Conference Paper %T On random kernels of residual architectures %A Etai Littwin %A Tomer Galanti %A Lior Wolf %B Proceedings of the Thirty-Seventh Conference on Uncertainty in Artificial Intelligence %C Proceedings of Machine Learning Research %D 2021 %E Cassio de Campos %E Marloes H. Maathuis %F pmlr-v161-littwin21a %I PMLR %P 897--907 %U https://proceedings.mlr.press/v161/littwin21a.html %V 161 %X We analyze the finite corrections to the neural tangent kernel (NTK) of residual and densely connected networks, as a function of both depth and width. Surprisingly, our analysis reveals that given a fixed depth, residual networks provide the best tradeoff between the parameter complexity and the coefficient of variation (normalized variance), followed by densely connected networks and vanilla MLPs. While in networks that do not use skip connections, convergence to the NTK requires one to fix the depth, while increasing the layers’ width. Our findings show that in ResNets, convergence to the NTK may occur when depth and width simultaneously tend to infinity, provided with a proper initialization. In DenseNets, however, the convergence of the NTK to its limit as the width tends to infinity is guaranteed, at a rate that is independent of both the depth and scale of the weights. Our experiments validate the theoretical results and demonstrate the advantage of deep ResNets and DenseNets for kernel regression with random gradient features.
APA
Littwin, E., Galanti, T. & Wolf, L.. (2021). On random kernels of residual architectures. Proceedings of the Thirty-Seventh Conference on Uncertainty in Artificial Intelligence, in Proceedings of Machine Learning Research 161:897-907 Available from https://proceedings.mlr.press/v161/littwin21a.html.

Related Material