Learning Neural Networks with Sparse Activations

Pranjal Awasthi, Nishanth Dikkala, Pritish Kamath, Raghu Meka
Proceedings of Thirty Seventh Conference on Learning Theory, PMLR 247:406-425, 2024.

Abstract

A core component present in many successful neural network architectures, is an MLP block of two fully connected layers with a non-linear activation in between. An intriguing phenomenon observed empirically, including in transformer architectures, is that, after training, the activations in the hidden layer of this MLP block tend to be extremely sparse on any given input. Unlike traditional forms of sparsity, where there are neurons/weights which can be deleted from the network, this form of {\em dynamic} activation sparsity appears to be harder to exploit to get more efficient networks. Motivated by this we initiate a formal study of PAC learnability of MLP layers that exhibit activation sparsity. We present a variety of results showing that such classes of functions do lead to provable computational and statistical advantages over their non-sparse counterparts. Our hope is that a better theoretical understanding of {\em sparsely activated} networks would lead to methods that can exploit activation sparsity in practice.

Cite this Paper


BibTeX
@InProceedings{pmlr-v247-awasthi24a, title = {Learning Neural Networks with Sparse Activations}, author = {Awasthi, Pranjal and Dikkala, Nishanth and Kamath, Pritish and Meka, Raghu}, booktitle = {Proceedings of Thirty Seventh Conference on Learning Theory}, pages = {406--425}, year = {2024}, editor = {Agrawal, Shipra and Roth, Aaron}, volume = {247}, series = {Proceedings of Machine Learning Research}, month = {30 Jun--03 Jul}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v247/awasthi24a/awasthi24a.pdf}, url = {https://proceedings.mlr.press/v247/awasthi24a.html}, abstract = {A core component present in many successful neural network architectures, is an MLP block of two fully connected layers with a non-linear activation in between. An intriguing phenomenon observed empirically, including in transformer architectures, is that, after training, the activations in the hidden layer of this MLP block tend to be extremely sparse on any given input. Unlike traditional forms of sparsity, where there are neurons/weights which can be deleted from the network, this form of {\em dynamic} activation sparsity appears to be harder to exploit to get more efficient networks. Motivated by this we initiate a formal study of PAC learnability of MLP layers that exhibit activation sparsity. We present a variety of results showing that such classes of functions do lead to provable computational and statistical advantages over their non-sparse counterparts. Our hope is that a better theoretical understanding of {\em sparsely activated} networks would lead to methods that can exploit activation sparsity in practice.} }
Endnote
%0 Conference Paper %T Learning Neural Networks with Sparse Activations %A Pranjal Awasthi %A Nishanth Dikkala %A Pritish Kamath %A Raghu Meka %B Proceedings of Thirty Seventh Conference on Learning Theory %C Proceedings of Machine Learning Research %D 2024 %E Shipra Agrawal %E Aaron Roth %F pmlr-v247-awasthi24a %I PMLR %P 406--425 %U https://proceedings.mlr.press/v247/awasthi24a.html %V 247 %X A core component present in many successful neural network architectures, is an MLP block of two fully connected layers with a non-linear activation in between. An intriguing phenomenon observed empirically, including in transformer architectures, is that, after training, the activations in the hidden layer of this MLP block tend to be extremely sparse on any given input. Unlike traditional forms of sparsity, where there are neurons/weights which can be deleted from the network, this form of {\em dynamic} activation sparsity appears to be harder to exploit to get more efficient networks. Motivated by this we initiate a formal study of PAC learnability of MLP layers that exhibit activation sparsity. We present a variety of results showing that such classes of functions do lead to provable computational and statistical advantages over their non-sparse counterparts. Our hope is that a better theoretical understanding of {\em sparsely activated} networks would lead to methods that can exploit activation sparsity in practice.
APA
Awasthi, P., Dikkala, N., Kamath, P. & Meka, R.. (2024). Learning Neural Networks with Sparse Activations. Proceedings of Thirty Seventh Conference on Learning Theory, in Proceedings of Machine Learning Research 247:406-425 Available from https://proceedings.mlr.press/v247/awasthi24a.html.

Related Material