Nonlinear Advantage: Trained Networks Might Not Be As Complex as You Think

Christian H.X. Ali Mehmeti-Göpel, Jan Disselhoff
Proceedings of the 40th International Conference on Machine Learning, PMLR 202:529-546, 2023.

Abstract

We perform an empirical study of the behaviour of deep networks when fully linearizing some of its feature channels through a sparsity prior on the overall number of nonlinear units in the network. In experiments on image classification and machine translation tasks, we investigate how much we can simplify the network function towards linearity before performance collapses. First, we observe a significant performance gap when reducing nonlinearity in the network function early on as opposed to late in training, in-line with recent observations on the time-evolution of the data-dependent NTK. Second, we find that after training, we are able to linearize a significant number of nonlinear units while maintaining a high performance, indicating that much of a network’s expressivity remains unused but helps gradient descent in early stages of training. To characterize the depth of the resulting partially linearized network, we introduce a measure called average path length, representing the average number of active nonlinearities encountered along a path in the network graph. Under sparsity pressure, we find that the remaining nonlinear units organize into distinct structures, forming core-networks of near constant effective depth and width, which in turn depend on task difficulty.

Cite this Paper


BibTeX
@InProceedings{pmlr-v202-ali-mehmeti-gopel23a, title = {Nonlinear Advantage: Trained Networks Might Not Be As Complex as You Think}, author = {Ali Mehmeti-G\"{o}pel, Christian H.X. and Disselhoff, Jan}, booktitle = {Proceedings of the 40th International Conference on Machine Learning}, pages = {529--546}, year = {2023}, editor = {Krause, Andreas and Brunskill, Emma and Cho, Kyunghyun and Engelhardt, Barbara and Sabato, Sivan and Scarlett, Jonathan}, volume = {202}, series = {Proceedings of Machine Learning Research}, month = {23--29 Jul}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v202/ali-mehmeti-gopel23a/ali-mehmeti-gopel23a.pdf}, url = {https://proceedings.mlr.press/v202/ali-mehmeti-gopel23a.html}, abstract = {We perform an empirical study of the behaviour of deep networks when fully linearizing some of its feature channels through a sparsity prior on the overall number of nonlinear units in the network. In experiments on image classification and machine translation tasks, we investigate how much we can simplify the network function towards linearity before performance collapses. First, we observe a significant performance gap when reducing nonlinearity in the network function early on as opposed to late in training, in-line with recent observations on the time-evolution of the data-dependent NTK. Second, we find that after training, we are able to linearize a significant number of nonlinear units while maintaining a high performance, indicating that much of a network’s expressivity remains unused but helps gradient descent in early stages of training. To characterize the depth of the resulting partially linearized network, we introduce a measure called average path length, representing the average number of active nonlinearities encountered along a path in the network graph. Under sparsity pressure, we find that the remaining nonlinear units organize into distinct structures, forming core-networks of near constant effective depth and width, which in turn depend on task difficulty.} }
Endnote
%0 Conference Paper %T Nonlinear Advantage: Trained Networks Might Not Be As Complex as You Think %A Christian H.X. Ali Mehmeti-Göpel %A Jan Disselhoff %B Proceedings of the 40th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2023 %E Andreas Krause %E Emma Brunskill %E Kyunghyun Cho %E Barbara Engelhardt %E Sivan Sabato %E Jonathan Scarlett %F pmlr-v202-ali-mehmeti-gopel23a %I PMLR %P 529--546 %U https://proceedings.mlr.press/v202/ali-mehmeti-gopel23a.html %V 202 %X We perform an empirical study of the behaviour of deep networks when fully linearizing some of its feature channels through a sparsity prior on the overall number of nonlinear units in the network. In experiments on image classification and machine translation tasks, we investigate how much we can simplify the network function towards linearity before performance collapses. First, we observe a significant performance gap when reducing nonlinearity in the network function early on as opposed to late in training, in-line with recent observations on the time-evolution of the data-dependent NTK. Second, we find that after training, we are able to linearize a significant number of nonlinear units while maintaining a high performance, indicating that much of a network’s expressivity remains unused but helps gradient descent in early stages of training. To characterize the depth of the resulting partially linearized network, we introduce a measure called average path length, representing the average number of active nonlinearities encountered along a path in the network graph. Under sparsity pressure, we find that the remaining nonlinear units organize into distinct structures, forming core-networks of near constant effective depth and width, which in turn depend on task difficulty.
APA
Ali Mehmeti-Göpel, C.H. & Disselhoff, J.. (2023). Nonlinear Advantage: Trained Networks Might Not Be As Complex as You Think. Proceedings of the 40th International Conference on Machine Learning, in Proceedings of Machine Learning Research 202:529-546 Available from https://proceedings.mlr.press/v202/ali-mehmeti-gopel23a.html.

Related Material