No Free Prune: Information-Theoretic Barriers to Pruning at Initialization

Tanishq Kumar; Kevin Luo; Mark Sellke

No Free Prune: Information-Theoretic Barriers to Pruning at Initialization

Tanishq Kumar, Kevin Luo, Mark Sellke

Proceedings of the 41st International Conference on Machine Learning, PMLR 235:25662-25681, 2024.

Abstract

The existence of “lottery tickets” (Frankle & Carbin, 2018) at or near initialization raises the tantalizing question of whether large models are necessary in deep learning, or whether sparse networks can be quickly identified and trained without ever training the dense models that contain them. However, efforts to find these sparse subnetworks without training the dense model (“pruning at initialization”) have been broadly unsuccessful (Frankle et al., 2020b). We put forward a theoretical explanation for this, based on the model’s effective parameter count, $p_\text{eff}$, given by the sum of the number of non-zero weights in the final network and the mutual information between the sparsity mask and the data. We show the Law of Robustness of (Bubeck & Sellke, 2023) extends to sparse networks with the usual parameter count replaced by $p_\text{eff}$, meaning a sparse neural network which robustly interpolates noisy data requires a heavily data-dependent mask. We posit that pruning during and after training outputs masks with higher mutual information than those produced by pruning at initialization. Thus two networks may have the same sparsities, but differ in effective parameter count based on how they were trained. This suggests that pruning near initialization may be infeasible and explains why lottery tickets exist, but cannot be found fast (i.e. without training the full network). Experiments on neural networks confirm that information gained during training may indeed affect model capacity.

Cite this Paper

BibTeX

@InProceedings{pmlr-v235-kumar24a,
  title = 	 {No Free Prune: Information-Theoretic Barriers to Pruning at Initialization},
  author =       {Kumar, Tanishq and Luo, Kevin and Sellke, Mark},
  booktitle = 	 {Proceedings of the 41st International Conference on Machine Learning},
  pages = 	 {25662--25681},
  year = 	 {2024},
  editor = 	 {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix},
  volume = 	 {235},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {21--27 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v235/main/assets/kumar24a/kumar24a.pdf},
  url = 	 {https://proceedings.mlr.press/v235/kumar24a.html},
  abstract = 	 {The existence of “lottery tickets” (Frankle & Carbin, 2018) at or near initialization raises the tantalizing question of whether large models are necessary in deep learning, or whether sparse networks can be quickly identified and trained without ever training the dense models that contain them. However, efforts to find these sparse subnetworks without training the dense model (“pruning at initialization”) have been broadly unsuccessful (Frankle et al., 2020b). We put forward a theoretical explanation for this, based on the model’s effective parameter count, $p_\text{eff}$, given by the sum of the number of non-zero weights in the final network and the mutual information between the sparsity mask and the data. We show the Law of Robustness of (Bubeck & Sellke, 2023) extends to sparse networks with the usual parameter count replaced by $p_\text{eff}$, meaning a sparse neural network which robustly interpolates noisy data requires a heavily data-dependent mask. We posit that pruning during and after training outputs masks with higher mutual information than those produced by pruning at initialization. Thus two networks may have the same sparsities, but differ in effective parameter count based on how they were trained. This suggests that pruning near initialization may be infeasible and explains why lottery tickets exist, but cannot be found fast (i.e. without training the full network). Experiments on neural networks confirm that information gained during training may indeed affect model capacity.}
}

Endnote

%0 Conference Paper
%T No Free Prune: Information-Theoretic Barriers to Pruning at Initialization
%A Tanishq Kumar
%A Kevin Luo
%A Mark Sellke
%B Proceedings of the 41st International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2024
%E Ruslan Salakhutdinov
%E Zico Kolter
%E Katherine Heller
%E Adrian Weller
%E Nuria Oliver
%E Jonathan Scarlett
%E Felix Berkenkamp	
%F pmlr-v235-kumar24a
%I PMLR
%P 25662--25681
%U https://proceedings.mlr.press/v235/kumar24a.html
%V 235
%X The existence of “lottery tickets” (Frankle & Carbin, 2018) at or near initialization raises the tantalizing question of whether large models are necessary in deep learning, or whether sparse networks can be quickly identified and trained without ever training the dense models that contain them. However, efforts to find these sparse subnetworks without training the dense model (“pruning at initialization”) have been broadly unsuccessful (Frankle et al., 2020b). We put forward a theoretical explanation for this, based on the model’s effective parameter count, $p_\text{eff}$, given by the sum of the number of non-zero weights in the final network and the mutual information between the sparsity mask and the data. We show the Law of Robustness of (Bubeck & Sellke, 2023) extends to sparse networks with the usual parameter count replaced by $p_\text{eff}$, meaning a sparse neural network which robustly interpolates noisy data requires a heavily data-dependent mask. We posit that pruning during and after training outputs masks with higher mutual information than those produced by pruning at initialization. Thus two networks may have the same sparsities, but differ in effective parameter count based on how they were trained. This suggests that pruning near initialization may be infeasible and explains why lottery tickets exist, but cannot be found fast (i.e. without training the full network). Experiments on neural networks confirm that information gained during training may indeed affect model capacity.

APA

Kumar, T., Luo, K. & Sellke, M.. (2024). No Free Prune: Information-Theoretic Barriers to Pruning at Initialization. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:25662-25681 Available from https://proceedings.mlr.press/v235/kumar24a.html.

No Free Prune: Information-Theoretic Barriers to Pruning at Initialization

Abstract

Cite this Paper

Related Material