Layer-Wise Neural Network Compression via Layer Fusion

James O’Neill, Greg V. Steeg, Aram Galstyan
Proceedings of The 13th Asian Conference on Machine Learning, PMLR 157:1381-1396, 2021.

Abstract

This paper proposes \textit{layer fusion} - a model compression technique that discovers which weights to combine and then fuses weights of similar fully-connected, convolutional and attention layers. Layer fusion can significantly reduce the number of layers of the original network with little additional computation overhead, while maintaining competitive performance. From experiments on CIFAR-10, we find that various deep convolution neural networks can remain within 2% accuracy points of the original networks up to a compression ratio of 3.33 when iteratively retrained with layer fusion. For experiments on the WikiText-2 language modelling dataset, we compress Transformer models to 20% of their original size while being within 5 perplexity points of the original network. We also find that other well-established compression techniques can achieve competitive performance when compared to their original networks given a sufficient number of retraining steps. Generally, we observe a clear inflection point in performance as the amount of compression increases, suggesting a bound on the amount of compression that can be achieved before an exponential degradation in performance.

Cite this Paper


BibTeX
@InProceedings{pmlr-v157-o-neill21a, title = {Layer-Wise Neural Network Compression via Layer Fusion}, author = {O'Neill, James and V. Steeg, Greg and Galstyan, Aram}, booktitle = {Proceedings of The 13th Asian Conference on Machine Learning}, pages = {1381--1396}, year = {2021}, editor = {Balasubramanian, Vineeth N. and Tsang, Ivor}, volume = {157}, series = {Proceedings of Machine Learning Research}, month = {17--19 Nov}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v157/o-neill21a/o-neill21a.pdf}, url = {https://proceedings.mlr.press/v157/o-neill21a.html}, abstract = { This paper proposes \textit{layer fusion} - a model compression technique that discovers which weights to combine and then fuses weights of similar fully-connected, convolutional and attention layers. Layer fusion can significantly reduce the number of layers of the original network with little additional computation overhead, while maintaining competitive performance. From experiments on CIFAR-10, we find that various deep convolution neural networks can remain within 2% accuracy points of the original networks up to a compression ratio of 3.33 when iteratively retrained with layer fusion. For experiments on the WikiText-2 language modelling dataset, we compress Transformer models to 20% of their original size while being within 5 perplexity points of the original network. We also find that other well-established compression techniques can achieve competitive performance when compared to their original networks given a sufficient number of retraining steps. Generally, we observe a clear inflection point in performance as the amount of compression increases, suggesting a bound on the amount of compression that can be achieved before an exponential degradation in performance. } }
Endnote
%0 Conference Paper %T Layer-Wise Neural Network Compression via Layer Fusion %A James O’Neill %A Greg V. Steeg %A Aram Galstyan %B Proceedings of The 13th Asian Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2021 %E Vineeth N. Balasubramanian %E Ivor Tsang %F pmlr-v157-o-neill21a %I PMLR %P 1381--1396 %U https://proceedings.mlr.press/v157/o-neill21a.html %V 157 %X This paper proposes \textit{layer fusion} - a model compression technique that discovers which weights to combine and then fuses weights of similar fully-connected, convolutional and attention layers. Layer fusion can significantly reduce the number of layers of the original network with little additional computation overhead, while maintaining competitive performance. From experiments on CIFAR-10, we find that various deep convolution neural networks can remain within 2% accuracy points of the original networks up to a compression ratio of 3.33 when iteratively retrained with layer fusion. For experiments on the WikiText-2 language modelling dataset, we compress Transformer models to 20% of their original size while being within 5 perplexity points of the original network. We also find that other well-established compression techniques can achieve competitive performance when compared to their original networks given a sufficient number of retraining steps. Generally, we observe a clear inflection point in performance as the amount of compression increases, suggesting a bound on the amount of compression that can be achieved before an exponential degradation in performance.
APA
O’Neill, J., V. Steeg, G. & Galstyan, A.. (2021). Layer-Wise Neural Network Compression via Layer Fusion. Proceedings of The 13th Asian Conference on Machine Learning, in Proceedings of Machine Learning Research 157:1381-1396 Available from https://proceedings.mlr.press/v157/o-neill21a.html.

Related Material