Boosting Deep Neural Network Efficiency with Dual-Module Inference

Liu Liu, Lei Deng, Zhaodong Chen, Yuke Wang, Shuangchen Li, Jingwei Zhang, Yihua Yang, Zhenyu Gu, Yufei Ding, Yuan Xie
Proceedings of the 37th International Conference on Machine Learning, PMLR 119:6205-6215, 2020.

Abstract

Using deep neural networks (DNNs) in machine learning tasks is promising in delivering high-quality results but challenging to meet stringent latency requirements and energy constraints because of the memory-bound and the compute-bound execution pattern of DNNs. We propose a big-little dual-module inference to dynamically skip unnecessary memory accesses and computations to accelerate DNN inference. Leveraging the noise-resilient feature of nonlinear activation functions, we propose to use a lightweight little module that approximates the original DNN layer, termed as the big module, to compute activations of the insensitive region that are more noise-resilient. Hence, the expensive memory accesses and computations of the big module can be reduced as the results are only calculated in the sensitive region. For memory-bound models such as recurrent neural networks (RNNs), our method can reduce the overall memory accesses by 40% on average and achieve 1.54x to 1.75x speedup on a commodity CPU-based server platform with a negligible impact on model quality. In addition, our method can reduce the operations of the compute-bound models such as convolutional neural networks (CNNs) by 3.02x, with only a 0.5% accuracy drop.

Cite this Paper


BibTeX
@InProceedings{pmlr-v119-liu20c, title = {Boosting Deep Neural Network Efficiency with Dual-Module Inference}, author = {Liu, Liu and Deng, Lei and Chen, Zhaodong and Wang, Yuke and Li, Shuangchen and Zhang, Jingwei and Yang, Yihua and Gu, Zhenyu and Ding, Yufei and Xie, Yuan}, booktitle = {Proceedings of the 37th International Conference on Machine Learning}, pages = {6205--6215}, year = {2020}, editor = {III, Hal Daumé and Singh, Aarti}, volume = {119}, series = {Proceedings of Machine Learning Research}, month = {13--18 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v119/liu20c/liu20c.pdf}, url = {https://proceedings.mlr.press/v119/liu20c.html}, abstract = {Using deep neural networks (DNNs) in machine learning tasks is promising in delivering high-quality results but challenging to meet stringent latency requirements and energy constraints because of the memory-bound and the compute-bound execution pattern of DNNs. We propose a big-little dual-module inference to dynamically skip unnecessary memory accesses and computations to accelerate DNN inference. Leveraging the noise-resilient feature of nonlinear activation functions, we propose to use a lightweight little module that approximates the original DNN layer, termed as the big module, to compute activations of the insensitive region that are more noise-resilient. Hence, the expensive memory accesses and computations of the big module can be reduced as the results are only calculated in the sensitive region. For memory-bound models such as recurrent neural networks (RNNs), our method can reduce the overall memory accesses by 40% on average and achieve 1.54x to 1.75x speedup on a commodity CPU-based server platform with a negligible impact on model quality. In addition, our method can reduce the operations of the compute-bound models such as convolutional neural networks (CNNs) by 3.02x, with only a 0.5% accuracy drop.} }
Endnote
%0 Conference Paper %T Boosting Deep Neural Network Efficiency with Dual-Module Inference %A Liu Liu %A Lei Deng %A Zhaodong Chen %A Yuke Wang %A Shuangchen Li %A Jingwei Zhang %A Yihua Yang %A Zhenyu Gu %A Yufei Ding %A Yuan Xie %B Proceedings of the 37th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2020 %E Hal Daumé III %E Aarti Singh %F pmlr-v119-liu20c %I PMLR %P 6205--6215 %U https://proceedings.mlr.press/v119/liu20c.html %V 119 %X Using deep neural networks (DNNs) in machine learning tasks is promising in delivering high-quality results but challenging to meet stringent latency requirements and energy constraints because of the memory-bound and the compute-bound execution pattern of DNNs. We propose a big-little dual-module inference to dynamically skip unnecessary memory accesses and computations to accelerate DNN inference. Leveraging the noise-resilient feature of nonlinear activation functions, we propose to use a lightweight little module that approximates the original DNN layer, termed as the big module, to compute activations of the insensitive region that are more noise-resilient. Hence, the expensive memory accesses and computations of the big module can be reduced as the results are only calculated in the sensitive region. For memory-bound models such as recurrent neural networks (RNNs), our method can reduce the overall memory accesses by 40% on average and achieve 1.54x to 1.75x speedup on a commodity CPU-based server platform with a negligible impact on model quality. In addition, our method can reduce the operations of the compute-bound models such as convolutional neural networks (CNNs) by 3.02x, with only a 0.5% accuracy drop.
APA
Liu, L., Deng, L., Chen, Z., Wang, Y., Li, S., Zhang, J., Yang, Y., Gu, Z., Ding, Y. & Xie, Y.. (2020). Boosting Deep Neural Network Efficiency with Dual-Module Inference. Proceedings of the 37th International Conference on Machine Learning, in Proceedings of Machine Learning Research 119:6205-6215 Available from https://proceedings.mlr.press/v119/liu20c.html.

Related Material