Boosting the Throughput and Accelerator Utilization of Specialized CNN Inference Beyond Increasing Batch Size

Jack Kosaian, Amar Phanishayee, Matthai Philipose, Debadeepta Dey, Rashmi Vinayak
Proceedings of the 38th International Conference on Machine Learning, PMLR 139:5731-5741, 2021.

Abstract

Datacenter vision systems widely use small, specialized convolutional neural networks (CNNs) trained on specific tasks for high-throughput inference. These settings employ accelerators with massive computational capacity, but which specialized CNNs underutilize due to having low arithmetic intensity. This results in suboptimal application-level throughput and poor returns on accelerator investment. Increasing batch size is the only known way to increase both application-level throughput and accelerator utilization for inference, but yields diminishing returns; specialized CNNs poorly utilize accelerators even with large batch size. We propose FoldedCNNs, a new approach to CNN design that increases inference throughput and utilization beyond large batch size. FoldedCNNs rethink the structure of inputs and layers of specialized CNNs to boost arithmetic intensity: in FoldedCNNs, f images with C channels each are concatenated into a single input with fC channels and jointly classified by a wider CNN. Increased arithmetic intensity in FoldedCNNs increases the throughput and GPU utilization of specialized CNN inference by up to 2.5x and 2.8x, with accuracy close to the original CNN in most cases.

Cite this Paper


BibTeX
@InProceedings{pmlr-v139-kosaian21a, title = {Boosting the Throughput and Accelerator Utilization of Specialized CNN Inference Beyond Increasing Batch Size}, author = {Kosaian, Jack and Phanishayee, Amar and Philipose, Matthai and Dey, Debadeepta and Vinayak, Rashmi}, booktitle = {Proceedings of the 38th International Conference on Machine Learning}, pages = {5731--5741}, year = {2021}, editor = {Meila, Marina and Zhang, Tong}, volume = {139}, series = {Proceedings of Machine Learning Research}, month = {18--24 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v139/kosaian21a/kosaian21a.pdf}, url = {https://proceedings.mlr.press/v139/kosaian21a.html}, abstract = {Datacenter vision systems widely use small, specialized convolutional neural networks (CNNs) trained on specific tasks for high-throughput inference. These settings employ accelerators with massive computational capacity, but which specialized CNNs underutilize due to having low arithmetic intensity. This results in suboptimal application-level throughput and poor returns on accelerator investment. Increasing batch size is the only known way to increase both application-level throughput and accelerator utilization for inference, but yields diminishing returns; specialized CNNs poorly utilize accelerators even with large batch size. We propose FoldedCNNs, a new approach to CNN design that increases inference throughput and utilization beyond large batch size. FoldedCNNs rethink the structure of inputs and layers of specialized CNNs to boost arithmetic intensity: in FoldedCNNs, f images with C channels each are concatenated into a single input with fC channels and jointly classified by a wider CNN. Increased arithmetic intensity in FoldedCNNs increases the throughput and GPU utilization of specialized CNN inference by up to 2.5x and 2.8x, with accuracy close to the original CNN in most cases.} }
Endnote
%0 Conference Paper %T Boosting the Throughput and Accelerator Utilization of Specialized CNN Inference Beyond Increasing Batch Size %A Jack Kosaian %A Amar Phanishayee %A Matthai Philipose %A Debadeepta Dey %A Rashmi Vinayak %B Proceedings of the 38th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2021 %E Marina Meila %E Tong Zhang %F pmlr-v139-kosaian21a %I PMLR %P 5731--5741 %U https://proceedings.mlr.press/v139/kosaian21a.html %V 139 %X Datacenter vision systems widely use small, specialized convolutional neural networks (CNNs) trained on specific tasks for high-throughput inference. These settings employ accelerators with massive computational capacity, but which specialized CNNs underutilize due to having low arithmetic intensity. This results in suboptimal application-level throughput and poor returns on accelerator investment. Increasing batch size is the only known way to increase both application-level throughput and accelerator utilization for inference, but yields diminishing returns; specialized CNNs poorly utilize accelerators even with large batch size. We propose FoldedCNNs, a new approach to CNN design that increases inference throughput and utilization beyond large batch size. FoldedCNNs rethink the structure of inputs and layers of specialized CNNs to boost arithmetic intensity: in FoldedCNNs, f images with C channels each are concatenated into a single input with fC channels and jointly classified by a wider CNN. Increased arithmetic intensity in FoldedCNNs increases the throughput and GPU utilization of specialized CNN inference by up to 2.5x and 2.8x, with accuracy close to the original CNN in most cases.
APA
Kosaian, J., Phanishayee, A., Philipose, M., Dey, D. & Vinayak, R.. (2021). Boosting the Throughput and Accelerator Utilization of Specialized CNN Inference Beyond Increasing Batch Size. Proceedings of the 38th International Conference on Machine Learning, in Proceedings of Machine Learning Research 139:5731-5741 Available from https://proceedings.mlr.press/v139/kosaian21a.html.

Related Material