Pruning is Optimal for Learning Sparse Features in High-Dimensions

Nuri Mert Vural; Murat A Erdogdu

Pruning is Optimal for Learning Sparse Features in High-Dimensions

Nuri Mert Vural, Murat A Erdogdu

Proceedings of Thirty Seventh Conference on Learning Theory, PMLR 247:4787-4861, 2024.

Abstract

While it is commonly observed in practice that pruning networks to a certain level of sparsity can improve the quality of the features, a theoretical explanation of this phenomenon remains elusive. In this work, we investigate this by demonstrating that a broad class of statistical models can be optimally learned using pruned neural networks trained with gradient descent, in high-dimensions. We consider learning both single-index and multi-index models of the form

$y = \sigma^*(\boldsymbol{V}^{\top} \boldsymbol{x}) + \epsilon$ , where

$\sigma^*$ is a degree-

$p$ polynomial, and

$\boldsymbol{V} \in \mathbbm{R}^{d \times r}$ with

$r \ll d$ , is the matrix containing relevant model directions. We assume that

$\boldsymbol{V}$ satisfies a certain

$\ell_q$ -sparsity condition for matrices and show that pruning neural networks proportional to the sparsity level of

$\boldsymbol{V}$ improves their sample complexity compared to unpruned networks. Furthermore, we establish Correlational Statistical Query (CSQ) lower bounds in this setting, which take the sparsity level of

$\boldsymbol{V}$ into account. We show that if the sparsity level of

$\boldsymbol{V}$ exceeds a certain threshold, training pruned networks with a gradient descent algorithm achieves the sample complexity suggested by the CSQ lower bound. In the same scenario, however, our results imply that basis-independent methods such as models trained via standard gradient descent initialized with rotationally invariant random weights can provably achieve only suboptimal sample complexity.

Cite this Paper

BibTeX


@InProceedings{pmlr-v247-vural24a,
  title = 	 {Pruning is Optimal for Learning Sparse Features in High-Dimensions},
  author =       {Vural, Nuri Mert and Erdogdu, Murat A},
  booktitle = 	 {Proceedings of Thirty Seventh Conference on Learning Theory},
  pages = 	 {4787--4861},
  year = 	 {2024},
  editor = 	 {Agrawal, Shipra and Roth, Aaron},
  volume = 	 {247},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {30 Jun--03 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v247/vural24a/vural24a.pdf},
  url = 	 {https://proceedings.mlr.press/v247/vural24a.html},
  abstract = 	 {While it is commonly observed in practice that pruning networks to a certain level of sparsity can improve the quality of the features, a theoretical explanation of this phenomenon remains elusive.  In this work, we investigate this by demonstrating that a broad class of statistical models can be optimally learned using pruned neural networks trained with gradient descent, in high-dimensions.  We consider learning both single-index and multi-index models of the form $y = \sigma^*(\boldsymbol{V}^{\top} \boldsymbol{x}) + \epsilon$,  where $\sigma^*$  is a degree-$p$ polynomial,  and $\boldsymbol{V} \in \mathbbm{R}^{d \times r}$ with $r \ll d$, is the matrix containing relevant model directions. We assume that $\boldsymbol{V}$ satisfies a certain $\ell_q$-sparsity condition for matrices and show that pruning neural networks proportional to the sparsity level of $\boldsymbol{V}$ improves their sample complexity compared to unpruned networks.  Furthermore, we establish  Correlational Statistical Query (CSQ) lower bounds in this setting, which take the sparsity level of $\boldsymbol{V}$ into account. We show that if the sparsity level of $\boldsymbol{V}$ exceeds a certain threshold, training pruned networks with a gradient descent algorithm achieves the sample complexity suggested by the CSQ lower bound.  In the same scenario, however,  our results imply that basis-independent methods such as models trained via standard gradient descent initialized with rotationally invariant random weights can provably achieve only suboptimal sample complexity.}
}

Endnote

%0 Conference Paper
%T Pruning is Optimal for Learning Sparse Features in High-Dimensions
%A Nuri Mert Vural
%A Murat A Erdogdu
%B Proceedings of Thirty Seventh Conference on Learning Theory
%C Proceedings of Machine Learning Research
%D 2024
%E Shipra Agrawal
%E Aaron Roth	
%F pmlr-v247-vural24a
%I PMLR
%P 4787--4861
%U https://proceedings.mlr.press/v247/vural24a.html
%V 247
%X While it is commonly observed in practice that pruning networks to a certain level of sparsity can improve the quality of the features, a theoretical explanation of this phenomenon remains elusive.  In this work, we investigate this by demonstrating that a broad class of statistical models can be optimally learned using pruned neural networks trained with gradient descent, in high-dimensions.  We consider learning both single-index and multi-index models of the form $y = \sigma^*(\boldsymbol{V}^{\top} \boldsymbol{x}) + \epsilon$,  where $\sigma^*$  is a degree-$p$ polynomial,  and $\boldsymbol{V} \in \mathbbm{R}^{d \times r}$ with $r \ll d$, is the matrix containing relevant model directions. We assume that $\boldsymbol{V}$ satisfies a certain $\ell_q$-sparsity condition for matrices and show that pruning neural networks proportional to the sparsity level of $\boldsymbol{V}$ improves their sample complexity compared to unpruned networks.  Furthermore, we establish  Correlational Statistical Query (CSQ) lower bounds in this setting, which take the sparsity level of $\boldsymbol{V}$ into account. We show that if the sparsity level of $\boldsymbol{V}$ exceeds a certain threshold, training pruned networks with a gradient descent algorithm achieves the sample complexity suggested by the CSQ lower bound.  In the same scenario, however,  our results imply that basis-independent methods such as models trained via standard gradient descent initialized with rotationally invariant random weights can provably achieve only suboptimal sample complexity.

APA


Vural, N.M. & Erdogdu, M.A.. (2024). Pruning is Optimal for Learning Sparse Features in High-Dimensions. Proceedings of Thirty Seventh Conference on Learning Theory, in Proceedings of Machine Learning Research 247:4787-4861 Available from https://proceedings.mlr.press/v247/vural24a.html.

Related Material

Download PDF