Efficient Latency-Aware CNN Depth Compression via Two-Stage Dynamic Programming

Jinuk Kim, Yeonwoo Jeong, Deokjae Lee, Hyun Oh Song
Proceedings of the 40th International Conference on Machine Learning, PMLR 202:16502-16520, 2023.

Abstract

Recent works on neural network pruning advocate that reducing the depth of the network is more effective in reducing run-time memory usage and accelerating inference latency than reducing the width of the network through channel pruning. In this regard, some recent works propose depth compression algorithms that merge convolution layers. However, the existing algorithms have a constricted search space and rely on human-engineered heuristics. In this paper, we propose a novel depth compression algorithm which targets general convolution operations. We propose a subset selection problem that replaces inefficient activation layers with identity functions and optimally merges consecutive convolution operations into shallow equivalent convolution operations for efficient end-to-end inference latency. Since the proposed subset selection problem is NP-hard, we formulate a surrogate optimization problem that can be solved exactly via two-stage dynamic programming within a few seconds. We evaluate our methods and baselines by TensorRT for a fair inference latency comparison. Our method outperforms the baseline method with higher accuracy and faster inference speed in MobileNetV2 on the ImageNet dataset. Specifically, we achieve $1.41\times$ speed-up with $0.11$%p accuracy gain in MobileNetV2-1.0 on the ImageNet.

Cite this Paper


BibTeX
@InProceedings{pmlr-v202-kim23f, title = {Efficient Latency-Aware {CNN} Depth Compression via Two-Stage Dynamic Programming}, author = {Kim, Jinuk and Jeong, Yeonwoo and Lee, Deokjae and Song, Hyun Oh}, booktitle = {Proceedings of the 40th International Conference on Machine Learning}, pages = {16502--16520}, year = {2023}, editor = {Krause, Andreas and Brunskill, Emma and Cho, Kyunghyun and Engelhardt, Barbara and Sabato, Sivan and Scarlett, Jonathan}, volume = {202}, series = {Proceedings of Machine Learning Research}, month = {23--29 Jul}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v202/kim23f/kim23f.pdf}, url = {https://proceedings.mlr.press/v202/kim23f.html}, abstract = {Recent works on neural network pruning advocate that reducing the depth of the network is more effective in reducing run-time memory usage and accelerating inference latency than reducing the width of the network through channel pruning. In this regard, some recent works propose depth compression algorithms that merge convolution layers. However, the existing algorithms have a constricted search space and rely on human-engineered heuristics. In this paper, we propose a novel depth compression algorithm which targets general convolution operations. We propose a subset selection problem that replaces inefficient activation layers with identity functions and optimally merges consecutive convolution operations into shallow equivalent convolution operations for efficient end-to-end inference latency. Since the proposed subset selection problem is NP-hard, we formulate a surrogate optimization problem that can be solved exactly via two-stage dynamic programming within a few seconds. We evaluate our methods and baselines by TensorRT for a fair inference latency comparison. Our method outperforms the baseline method with higher accuracy and faster inference speed in MobileNetV2 on the ImageNet dataset. Specifically, we achieve $1.41\times$ speed-up with $0.11$%p accuracy gain in MobileNetV2-1.0 on the ImageNet.} }
Endnote
%0 Conference Paper %T Efficient Latency-Aware CNN Depth Compression via Two-Stage Dynamic Programming %A Jinuk Kim %A Yeonwoo Jeong %A Deokjae Lee %A Hyun Oh Song %B Proceedings of the 40th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2023 %E Andreas Krause %E Emma Brunskill %E Kyunghyun Cho %E Barbara Engelhardt %E Sivan Sabato %E Jonathan Scarlett %F pmlr-v202-kim23f %I PMLR %P 16502--16520 %U https://proceedings.mlr.press/v202/kim23f.html %V 202 %X Recent works on neural network pruning advocate that reducing the depth of the network is more effective in reducing run-time memory usage and accelerating inference latency than reducing the width of the network through channel pruning. In this regard, some recent works propose depth compression algorithms that merge convolution layers. However, the existing algorithms have a constricted search space and rely on human-engineered heuristics. In this paper, we propose a novel depth compression algorithm which targets general convolution operations. We propose a subset selection problem that replaces inefficient activation layers with identity functions and optimally merges consecutive convolution operations into shallow equivalent convolution operations for efficient end-to-end inference latency. Since the proposed subset selection problem is NP-hard, we formulate a surrogate optimization problem that can be solved exactly via two-stage dynamic programming within a few seconds. We evaluate our methods and baselines by TensorRT for a fair inference latency comparison. Our method outperforms the baseline method with higher accuracy and faster inference speed in MobileNetV2 on the ImageNet dataset. Specifically, we achieve $1.41\times$ speed-up with $0.11$%p accuracy gain in MobileNetV2-1.0 on the ImageNet.
APA
Kim, J., Jeong, Y., Lee, D. & Song, H.O.. (2023). Efficient Latency-Aware CNN Depth Compression via Two-Stage Dynamic Programming. Proceedings of the 40th International Conference on Machine Learning, in Proceedings of Machine Learning Research 202:16502-16520 Available from https://proceedings.mlr.press/v202/kim23f.html.

Related Material