Deep Tensor Convolution on Multicores

David Budden, Alexander Matveev, Shibani Santurkar, Shraman Ray Chaudhuri, Nir Shavit
Proceedings of the 34th International Conference on Machine Learning, PMLR 70:615-624, 2017.

Abstract

Deep convolutional neural networks (ConvNets) of 3-dimensional kernels allow joint modeling of spatiotemporal features. These networks have improved performance of video and volumetric image analysis, but have been limited in size due to the low memory ceiling of GPU hardware. Existing CPU implementations overcome this constraint but are impractically slow. Here we extend and optimize the faster Winograd-class of convolutional algorithms to the $N$-dimensional case and specifically for CPU hardware. First, we remove the need to manually hand-craft algorithms by exploiting the relaxed constraints and cheap sparse access of CPU memory. Second, we maximize CPU utilization and multicore scalability by transforming data matrices to be cache-aware, integer multiples of AVX vector widths. Treating 2-dimensional ConvNets as a special (and the least beneficial) case of our approach, we demonstrate a 5 to 25-fold improvement in throughput compared to previous state-of-the-art.

Cite this Paper


BibTeX
@InProceedings{pmlr-v70-budden17a, title = {Deep Tensor Convolution on Multicores}, author = {David Budden and Alexander Matveev and Shibani Santurkar and Shraman Ray Chaudhuri and Nir Shavit}, booktitle = {Proceedings of the 34th International Conference on Machine Learning}, pages = {615--624}, year = {2017}, editor = {Precup, Doina and Teh, Yee Whye}, volume = {70}, series = {Proceedings of Machine Learning Research}, month = {06--11 Aug}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v70/budden17a/budden17a.pdf}, url = {https://proceedings.mlr.press/v70/budden17a.html}, abstract = {Deep convolutional neural networks (ConvNets) of 3-dimensional kernels allow joint modeling of spatiotemporal features. These networks have improved performance of video and volumetric image analysis, but have been limited in size due to the low memory ceiling of GPU hardware. Existing CPU implementations overcome this constraint but are impractically slow. Here we extend and optimize the faster Winograd-class of convolutional algorithms to the $N$-dimensional case and specifically for CPU hardware. First, we remove the need to manually hand-craft algorithms by exploiting the relaxed constraints and cheap sparse access of CPU memory. Second, we maximize CPU utilization and multicore scalability by transforming data matrices to be cache-aware, integer multiples of AVX vector widths. Treating 2-dimensional ConvNets as a special (and the least beneficial) case of our approach, we demonstrate a 5 to 25-fold improvement in throughput compared to previous state-of-the-art.} }
Endnote
%0 Conference Paper %T Deep Tensor Convolution on Multicores %A David Budden %A Alexander Matveev %A Shibani Santurkar %A Shraman Ray Chaudhuri %A Nir Shavit %B Proceedings of the 34th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2017 %E Doina Precup %E Yee Whye Teh %F pmlr-v70-budden17a %I PMLR %P 615--624 %U https://proceedings.mlr.press/v70/budden17a.html %V 70 %X Deep convolutional neural networks (ConvNets) of 3-dimensional kernels allow joint modeling of spatiotemporal features. These networks have improved performance of video and volumetric image analysis, but have been limited in size due to the low memory ceiling of GPU hardware. Existing CPU implementations overcome this constraint but are impractically slow. Here we extend and optimize the faster Winograd-class of convolutional algorithms to the $N$-dimensional case and specifically for CPU hardware. First, we remove the need to manually hand-craft algorithms by exploiting the relaxed constraints and cheap sparse access of CPU memory. Second, we maximize CPU utilization and multicore scalability by transforming data matrices to be cache-aware, integer multiples of AVX vector widths. Treating 2-dimensional ConvNets as a special (and the least beneficial) case of our approach, we demonstrate a 5 to 25-fold improvement in throughput compared to previous state-of-the-art.
APA
Budden, D., Matveev, A., Santurkar, S., Chaudhuri, S.R. & Shavit, N.. (2017). Deep Tensor Convolution on Multicores. Proceedings of the 34th International Conference on Machine Learning, in Proceedings of Machine Learning Research 70:615-624 Available from https://proceedings.mlr.press/v70/budden17a.html.

Related Material