Persistent RNNs: Stashing Recurrent Weights On-Chip

Greg Diamos; Shubho Sengupta; Bryan Catanzaro; Mike Chrzanowski; Adam Coates; Erich Elsen; Jesse Engel; Awni Hannun; Sanjeev Satheesh

Persistent RNNs: Stashing Recurrent Weights On-Chip

Greg Diamos, Shubho Sengupta, Bryan Catanzaro, Mike Chrzanowski, Adam Coates, Erich Elsen, Jesse Engel, Awni Hannun, Sanjeev Satheesh

Proceedings of The 33rd International Conference on Machine Learning, PMLR 48:2024-2033, 2016.

Abstract

This paper introduces a new technique for mapping Deep Recurrent Neural Networks (RNN) efficiently onto GPUs. We show how it is possi- ble to achieve substantially higher computational throughput at low mini-batch sizes than direct implementations of RNNs based on matrix multiplications. The key to our approach is the use of persistent computational kernels that exploit the GPU’s inverted memory hierarchy to reuse network weights over multiple timesteps. Our initial implementation sustains 2.8 TFLOP/s at a mini-batch size of 4 on an NVIDIA TitanX GPU. This provides a 16x reduction in activation memory footprint, enables model training with 12x more parameters on the same hardware, allows us to strongly scale RNN training to 128 GPUs, and allows us to efficiently explore end-to-end speech recognition models with over 100 layers.

Cite this Paper

BibTeX


@InProceedings{pmlr-v48-diamos16,
  title = 	 {Persistent RNNs: Stashing Recurrent Weights On-Chip},
  author = 	 {Diamos, Greg and Sengupta, Shubho and Catanzaro, Bryan and Chrzanowski, Mike and Coates, Adam and Elsen, Erich and Engel, Jesse and Hannun, Awni and Satheesh, Sanjeev},
  booktitle = 	 {Proceedings of The 33rd International Conference on Machine Learning},
  pages = 	 {2024--2033},
  year = 	 {2016},
  editor = 	 {Balcan, Maria Florina and Weinberger, Kilian Q.},
  volume = 	 {48},
  series = 	 {Proceedings of Machine Learning Research},
  address = 	 {New York, New York, USA},
  month = 	 {20--22 Jun},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v48/diamos16.pdf},
  url = 	 {https://proceedings.mlr.press/v48/diamos16.html},
  abstract = 	 {This paper introduces a new technique for mapping Deep Recurrent Neural Networks (RNN) efficiently onto GPUs. We show how it is possi- ble to achieve substantially higher computational throughput at low mini-batch sizes than direct implementations of RNNs based on matrix multiplications. The key to our approach is the use of persistent computational kernels that exploit the GPU’s inverted memory hierarchy to reuse network weights over multiple timesteps. Our initial implementation sustains 2.8 TFLOP/s at a mini-batch size of 4 on an NVIDIA TitanX GPU. This provides a 16x reduction in activation memory footprint, enables model training with 12x more parameters on the same hardware, allows us to strongly scale RNN training to 128 GPUs, and allows us to efficiently explore end-to-end speech recognition models with over 100 layers.}
}

Endnote

%0 Conference Paper
%T Persistent RNNs: Stashing Recurrent Weights On-Chip
%A Greg Diamos
%A Shubho Sengupta
%A Bryan Catanzaro
%A Mike Chrzanowski
%A Adam Coates
%A Erich Elsen
%A Jesse Engel
%A Awni Hannun
%A Sanjeev Satheesh
%B Proceedings of The 33rd International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2016
%E Maria Florina Balcan
%E Kilian Q. Weinberger	
%F pmlr-v48-diamos16
%I PMLR
%P 2024--2033
%U https://proceedings.mlr.press/v48/diamos16.html
%V 48
%X This paper introduces a new technique for mapping Deep Recurrent Neural Networks (RNN) efficiently onto GPUs. We show how it is possi- ble to achieve substantially higher computational throughput at low mini-batch sizes than direct implementations of RNNs based on matrix multiplications. The key to our approach is the use of persistent computational kernels that exploit the GPU’s inverted memory hierarchy to reuse network weights over multiple timesteps. Our initial implementation sustains 2.8 TFLOP/s at a mini-batch size of 4 on an NVIDIA TitanX GPU. This provides a 16x reduction in activation memory footprint, enables model training with 12x more parameters on the same hardware, allows us to strongly scale RNN training to 128 GPUs, and allows us to efficiently explore end-to-end speech recognition models with over 100 layers.

RIS


TY  - CPAPER
TI  - Persistent RNNs: Stashing Recurrent Weights On-Chip
AU  - Greg Diamos
AU  - Shubho Sengupta
AU  - Bryan Catanzaro
AU  - Mike Chrzanowski
AU  - Adam Coates
AU  - Erich Elsen
AU  - Jesse Engel
AU  - Awni Hannun
AU  - Sanjeev Satheesh
BT  - Proceedings of The 33rd International Conference on Machine Learning
DA  - 2016/06/11
ED  - Maria Florina Balcan
ED  - Kilian Q. Weinberger	
ID  - pmlr-v48-diamos16
PB  - PMLR
DP  - Proceedings of Machine Learning Research
VL  - 48
SP  - 2024
EP  - 2033
L1  - http://proceedings.mlr.press/v48/diamos16.pdf
UR  - https://proceedings.mlr.press/v48/diamos16.html
AB  - This paper introduces a new technique for mapping Deep Recurrent Neural Networks (RNN) efficiently onto GPUs. We show how it is possi- ble to achieve substantially higher computational throughput at low mini-batch sizes than direct implementations of RNNs based on matrix multiplications. The key to our approach is the use of persistent computational kernels that exploit the GPU’s inverted memory hierarchy to reuse network weights over multiple timesteps. Our initial implementation sustains 2.8 TFLOP/s at a mini-batch size of 4 on an NVIDIA TitanX GPU. This provides a 16x reduction in activation memory footprint, enables model training with 12x more parameters on the same hardware, allows us to strongly scale RNN training to 128 GPUs, and allows us to efficiently explore end-to-end speech recognition models with over 100 layers.
ER  -

APA


Diamos, G., Sengupta, S., Catanzaro, B., Chrzanowski, M., Coates, A., Elsen, E., Engel, J., Hannun, A. & Satheesh, S.. (2016). Persistent RNNs: Stashing Recurrent Weights On-Chip. Proceedings of The 33rd International Conference on Machine Learning, in Proceedings of Machine Learning Research 48:2024-2033 Available from https://proceedings.mlr.press/v48/diamos16.html.

Related Material

Download PDF