Layer-Wise Neural Network Compression via Layer Fusion

James O’Neill; Greg V. Steeg; Aram Galstyan

Layer-Wise Neural Network Compression via Layer Fusion

James O’Neill, Greg V. Steeg, Aram Galstyan

Proceedings of The 13th Asian Conference on Machine Learning, PMLR 157:1381-1396, 2021.

Abstract

This paper proposes \textit{layer fusion} - a model compression technique that discovers which weights to combine and then fuses weights of similar fully-connected, convolutional and attention layers. Layer fusion can significantly reduce the number of layers of the original network with little additional computation overhead, while maintaining competitive performance. From experiments on CIFAR-10, we find that various deep convolution neural networks can remain within 2% accuracy points of the original networks up to a compression ratio of 3.33 when iteratively retrained with layer fusion. For experiments on the WikiText-2 language modelling dataset, we compress Transformer models to 20% of their original size while being within 5 perplexity points of the original network. We also find that other well-established compression techniques can achieve competitive performance when compared to their original networks given a sufficient number of retraining steps. Generally, we observe a clear inflection point in performance as the amount of compression increases, suggesting a bound on the amount of compression that can be achieved before an exponential degradation in performance.

Cite this Paper

BibTeX


@InProceedings{pmlr-v157-o-neill21a,
  title = 	 {Layer-Wise Neural Network Compression via Layer Fusion},
  author =       {O'Neill, James and V. Steeg, Greg and Galstyan, Aram},
  booktitle = 	 {Proceedings of The 13th Asian Conference on Machine Learning},
  pages = 	 {1381--1396},
  year = 	 {2021},
  editor = 	 {Balasubramanian, Vineeth N. and Tsang, Ivor},
  volume = 	 {157},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {17--19 Nov},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v157/o-neill21a/o-neill21a.pdf},
  url = 	 {https://proceedings.mlr.press/v157/o-neill21a.html},
  abstract = 	 { This paper proposes \textit{layer fusion} - a model compression technique that discovers which weights to combine and then fuses weights of similar fully-connected, convolutional and attention layers. Layer fusion can significantly reduce the number of layers of the original network with little additional computation overhead, while maintaining competitive performance. From experiments on CIFAR-10, we find that various deep convolution neural networks can remain within 2% accuracy points of the original networks up to a compression ratio of 3.33 when iteratively retrained with layer fusion. For experiments on the WikiText-2 language modelling dataset, we compress Transformer models to 20% of their original size while being within 5 perplexity points of the original network. We also find that other well-established compression techniques can achieve competitive performance when compared to their original networks given a sufficient number of retraining steps. Generally, we observe a clear inflection point in performance as the amount of compression increases, suggesting a bound on the amount of compression that can be achieved before an exponential degradation in performance.  }
}

Endnote

%0 Conference Paper
%T Layer-Wise Neural Network Compression via Layer Fusion
%A James O’Neill
%A Greg V. Steeg
%A Aram Galstyan
%B Proceedings of The 13th Asian Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2021
%E Vineeth N. Balasubramanian
%E Ivor Tsang	
%F pmlr-v157-o-neill21a
%I PMLR
%P 1381--1396
%U https://proceedings.mlr.press/v157/o-neill21a.html
%V 157
%X  This paper proposes \textit{layer fusion} - a model compression technique that discovers which weights to combine and then fuses weights of similar fully-connected, convolutional and attention layers. Layer fusion can significantly reduce the number of layers of the original network with little additional computation overhead, while maintaining competitive performance. From experiments on CIFAR-10, we find that various deep convolution neural networks can remain within 2% accuracy points of the original networks up to a compression ratio of 3.33 when iteratively retrained with layer fusion. For experiments on the WikiText-2 language modelling dataset, we compress Transformer models to 20% of their original size while being within 5 perplexity points of the original network. We also find that other well-established compression techniques can achieve competitive performance when compared to their original networks given a sufficient number of retraining steps. Generally, we observe a clear inflection point in performance as the amount of compression increases, suggesting a bound on the amount of compression that can be achieved before an exponential degradation in performance.

APA


O’Neill, J., V. Steeg, G. & Galstyan, A.. (2021). Layer-Wise Neural Network Compression via Layer Fusion. Proceedings of The 13th Asian Conference on Machine Learning, in Proceedings of Machine Learning Research 157:1381-1396 Available from https://proceedings.mlr.press/v157/o-neill21a.html.

Related Material

Download PDF