Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin

Dario Amodei; Sundaram Ananthanarayanan; Rishita Anubhai; Jingliang Bai; Eric Battenberg; Carl Case; Jared Casper; Bryan Catanzaro; Qiang Cheng; Guoliang Chen; Jie Chen; Jingdong Chen; Zhijie Chen; Mike Chrzanowski; Adam Coates; Greg Diamos; Ke Ding; Niandong Du; Erich Elsen; Jesse Engel; Weiwei Fang; Linxi Fan; Christopher Fougner; Liang Gao; Caixia Gong; Awni Hannun; Tony Han; Lappi Johannes; Bing Jiang; Cai Ju; Billy Jun; Patrick LeGresley; Libby Lin; Junjie Liu; Yang Liu; Weigao Li; Xiangang Li; Dongpeng Ma; Sharan Narang; Andrew Ng; Sherjil Ozair; Yiping Peng; Ryan Prenger; Sheng Qian; Zongfeng Quan; Jonathan Raiman; Vinay Rao; Sanjeev Satheesh; David Seetapun; Shubho Sengupta; Kavya Srinet; Anuroop Sriram; Haiyuan Tang; Liliang Tang; Chong Wang; Jidong Wang; Kaifu Wang; Yi Wang; Zhijian Wang; Zhiqian Wang; Shuang Wu; Likai Wei; Bo Xiao; Wen Xie; Yan Xie; Dani Yogatama; Bin Yuan; Jun Zhan; Zhenyao Zhu

Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin

Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, Jingliang Bai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Qiang Cheng, Guoliang Chen, Jie Chen, Jingdong Chen, Zhijie Chen, Mike Chrzanowski, Adam Coates, Greg Diamos, Ke Ding, Niandong Du, Erich Elsen, Jesse Engel, Weiwei Fang, Linxi Fan, Christopher Fougner, Liang Gao, Caixia Gong, Awni Hannun, Tony Han, Lappi Johannes, Bing Jiang, Cai Ju, Billy Jun, Patrick LeGresley, Libby Lin, Junjie Liu, Yang Liu, Weigao Li, Xiangang Li, Dongpeng Ma, Sharan Narang, Andrew Ng, Sherjil Ozair, Yiping Peng, Ryan Prenger, Sheng Qian, Zongfeng Quan, Jonathan Raiman, Vinay Rao, Sanjeev Satheesh, David Seetapun, Shubho Sengupta, Kavya Srinet, Anuroop Sriram, Haiyuan Tang, Liliang Tang, Chong Wang, Jidong Wang, Kaifu Wang, Yi Wang, Zhijian Wang, Zhiqian Wang, Shuang Wu, Likai Wei, Bo Xiao, Wen Xie, Yan Xie, Dani Yogatama, Bin Yuan, Jun Zhan, Zhenyao Zhu

Proceedings of The 33rd International Conference on Machine Learning, PMLR 48:173-182, 2016.

Abstract

We show that an end-to-end deep learning approach can be used to recognize either English or Mandarin Chinese speech–two vastly different languages. Because it replaces entire pipelines of hand-engineered components with neural networks, end-to-end learning allows us to handle a diverse variety of speech including noisy environments, accents and different languages. Key to our approach is our application of HPC techniques, enabling experiments that previously took weeks to now run in days. This allows us to iterate more quickly to identify superior architectures and algorithms. As a result, in several cases, our system is competitive with the transcription of human workers when benchmarked on standard datasets. Finally, using a technique called Batch Dispatch with GPUs in the data center, we show that our system can be inexpensively deployed in an online setting, delivering low latency when serving users at scale.

Cite this Paper

BibTeX


@InProceedings{pmlr-v48-amodei16,
  title = 	 {Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin},
  author = 	 {Amodei, Dario and Ananthanarayanan, Sundaram and Anubhai, Rishita and Bai, Jingliang and Battenberg, Eric and Case, Carl and Casper, Jared and Catanzaro, Bryan and Cheng, Qiang and Chen, Guoliang and Chen, Jie and Chen, Jingdong and Chen, Zhijie and Chrzanowski, Mike and Coates, Adam and Diamos, Greg and Ding, Ke and Du, Niandong and Elsen, Erich and Engel, Jesse and Fang, Weiwei and Fan, Linxi and Fougner, Christopher and Gao, Liang and Gong, Caixia and Hannun, Awni and Han, Tony and Johannes, Lappi and Jiang, Bing and Ju, Cai and Jun, Billy and LeGresley, Patrick and Lin, Libby and Liu, Junjie and Liu, Yang and Li, Weigao and Li, Xiangang and Ma, Dongpeng and Narang, Sharan and Ng, Andrew and Ozair, Sherjil and Peng, Yiping and Prenger, Ryan and Qian, Sheng and Quan, Zongfeng and Raiman, Jonathan and Rao, Vinay and Satheesh, Sanjeev and Seetapun, David and Sengupta, Shubho and Srinet, Kavya and Sriram, Anuroop and Tang, Haiyuan and Tang, Liliang and Wang, Chong and Wang, Jidong and Wang, Kaifu and Wang, Yi and Wang, Zhijian and Wang, Zhiqian and Wu, Shuang and Wei, Likai and Xiao, Bo and Xie, Wen and Xie, Yan and Yogatama, Dani and Yuan, Bin and Zhan, Jun and Zhu, Zhenyao},
  booktitle = 	 {Proceedings of The 33rd International Conference on Machine Learning},
  pages = 	 {173--182},
  year = 	 {2016},
  editor = 	 {Balcan, Maria Florina and Weinberger, Kilian Q.},
  volume = 	 {48},
  series = 	 {Proceedings of Machine Learning Research},
  address = 	 {New York, New York, USA},
  month = 	 {20--22 Jun},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v48/amodei16.pdf},
  url = 	 {https://proceedings.mlr.press/v48/amodei16.html},
  abstract = 	 {We show that an end-to-end deep learning approach can be used to recognize either English or Mandarin Chinese speech–two vastly different languages. Because it replaces entire pipelines of hand-engineered components with neural networks, end-to-end learning allows us to handle a diverse variety of speech including noisy environments, accents and different languages. Key to our approach is our application of HPC techniques, enabling experiments that previously took weeks to now run in days. This allows us to iterate more quickly to identify superior architectures and algorithms. As a result, in several cases, our system is competitive with the transcription of human workers when benchmarked on standard datasets. Finally, using a technique called Batch Dispatch with GPUs in the data center, we show that our system can be inexpensively deployed in an online setting, delivering low latency when serving users at scale.}
}

Endnote

%0 Conference Paper
%T Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin
%A Dario Amodei
%A Sundaram Ananthanarayanan
%A Rishita Anubhai
%A Jingliang Bai
%A Eric Battenberg
%A Carl Case
%A Jared Casper
%A Bryan Catanzaro
%A Qiang Cheng
%A Guoliang Chen
%A Jie Chen
%A Jingdong Chen
%A Zhijie Chen
%A Mike Chrzanowski
%A Adam Coates
%A Greg Diamos
%A Ke Ding
%A Niandong Du
%A Erich Elsen
%A Jesse Engel
%A Weiwei Fang
%A Linxi Fan
%A Christopher Fougner
%A Liang Gao
%A Caixia Gong
%A Awni Hannun
%A Tony Han
%A Lappi Johannes
%A Bing Jiang
%A Cai Ju
%A Billy Jun
%A Patrick LeGresley
%A Libby Lin
%A Junjie Liu
%A Yang Liu
%A Weigao Li
%A Xiangang Li
%A Dongpeng Ma
%A Sharan Narang
%A Andrew Ng
%A Sherjil Ozair
%A Yiping Peng
%A Ryan Prenger
%A Sheng Qian
%A Zongfeng Quan
%A Jonathan Raiman
%A Vinay Rao
%A Sanjeev Satheesh
%A David Seetapun
%A Shubho Sengupta
%A Kavya Srinet
%A Anuroop Sriram
%A Haiyuan Tang
%A Liliang Tang
%A Chong Wang
%A Jidong Wang
%A Kaifu Wang
%A Yi Wang
%A Zhijian Wang
%A Zhiqian Wang
%A Shuang Wu
%A Likai Wei
%A Bo Xiao
%A Wen Xie
%A Yan Xie
%A Dani Yogatama
%A Bin Yuan
%A Jun Zhan
%A Zhenyao Zhu
%B Proceedings of The 33rd International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2016
%E Maria Florina Balcan
%E Kilian Q. Weinberger	
%F pmlr-v48-amodei16
%I PMLR
%P 173--182
%U https://proceedings.mlr.press/v48/amodei16.html
%V 48
%X We show that an end-to-end deep learning approach can be used to recognize either English or Mandarin Chinese speech–two vastly different languages. Because it replaces entire pipelines of hand-engineered components with neural networks, end-to-end learning allows us to handle a diverse variety of speech including noisy environments, accents and different languages. Key to our approach is our application of HPC techniques, enabling experiments that previously took weeks to now run in days. This allows us to iterate more quickly to identify superior architectures and algorithms. As a result, in several cases, our system is competitive with the transcription of human workers when benchmarked on standard datasets. Finally, using a technique called Batch Dispatch with GPUs in the data center, we show that our system can be inexpensively deployed in an online setting, delivering low latency when serving users at scale.

RIS


TY  - CPAPER
TI  - Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin
AU  - Dario Amodei
AU  - Sundaram Ananthanarayanan
AU  - Rishita Anubhai
AU  - Jingliang Bai
AU  - Eric Battenberg
AU  - Carl Case
AU  - Jared Casper
AU  - Bryan Catanzaro
AU  - Qiang Cheng
AU  - Guoliang Chen
AU  - Jie Chen
AU  - Jingdong Chen
AU  - Zhijie Chen
AU  - Mike Chrzanowski
AU  - Adam Coates
AU  - Greg Diamos
AU  - Ke Ding
AU  - Niandong Du
AU  - Erich Elsen
AU  - Jesse Engel
AU  - Weiwei Fang
AU  - Linxi Fan
AU  - Christopher Fougner
AU  - Liang Gao
AU  - Caixia Gong
AU  - Awni Hannun
AU  - Tony Han
AU  - Lappi Johannes
AU  - Bing Jiang
AU  - Cai Ju
AU  - Billy Jun
AU  - Patrick LeGresley
AU  - Libby Lin
AU  - Junjie Liu
AU  - Yang Liu
AU  - Weigao Li
AU  - Xiangang Li
AU  - Dongpeng Ma
AU  - Sharan Narang
AU  - Andrew Ng
AU  - Sherjil Ozair
AU  - Yiping Peng
AU  - Ryan Prenger
AU  - Sheng Qian
AU  - Zongfeng Quan
AU  - Jonathan Raiman
AU  - Vinay Rao
AU  - Sanjeev Satheesh
AU  - David Seetapun
AU  - Shubho Sengupta
AU  - Kavya Srinet
AU  - Anuroop Sriram
AU  - Haiyuan Tang
AU  - Liliang Tang
AU  - Chong Wang
AU  - Jidong Wang
AU  - Kaifu Wang
AU  - Yi Wang
AU  - Zhijian Wang
AU  - Zhiqian Wang
AU  - Shuang Wu
AU  - Likai Wei
AU  - Bo Xiao
AU  - Wen Xie
AU  - Yan Xie
AU  - Dani Yogatama
AU  - Bin Yuan
AU  - Jun Zhan
AU  - Zhenyao Zhu
BT  - Proceedings of The 33rd International Conference on Machine Learning
DA  - 2016/06/11
ED  - Maria Florina Balcan
ED  - Kilian Q. Weinberger	
ID  - pmlr-v48-amodei16
PB  - PMLR
DP  - Proceedings of Machine Learning Research
VL  - 48
SP  - 173
EP  - 182
L1  - http://proceedings.mlr.press/v48/amodei16.pdf
UR  - https://proceedings.mlr.press/v48/amodei16.html
AB  - We show that an end-to-end deep learning approach can be used to recognize either English or Mandarin Chinese speech–two vastly different languages. Because it replaces entire pipelines of hand-engineered components with neural networks, end-to-end learning allows us to handle a diverse variety of speech including noisy environments, accents and different languages. Key to our approach is our application of HPC techniques, enabling experiments that previously took weeks to now run in days. This allows us to iterate more quickly to identify superior architectures and algorithms. As a result, in several cases, our system is competitive with the transcription of human workers when benchmarked on standard datasets. Finally, using a technique called Batch Dispatch with GPUs in the data center, we show that our system can be inexpensively deployed in an online setting, delivering low latency when serving users at scale.
ER  -

APA


Amodei, D., Ananthanarayanan, S., Anubhai, R., Bai, J., Battenberg, E., Case, C., Casper, J., Catanzaro, B., Cheng, Q., Chen, G., Chen, J., Chen, J., Chen, Z., Chrzanowski, M., Coates, A., Diamos, G., Ding, K., Du, N., Elsen, E., Engel, J., Fang, W., Fan, L., Fougner, C., Gao, L., Gong, C., Hannun, A., Han, T., Johannes, L., Jiang, B., Ju, C., Jun, B., LeGresley, P., Lin, L., Liu, J., Liu, Y., Li, W., Li, X., Ma, D., Narang, S., Ng, A., Ozair, S., Peng, Y., Prenger, R., Qian, S., Quan, Z., Raiman, J., Rao, V., Satheesh, S., Seetapun, D., Sengupta, S., Srinet, K., Sriram, A., Tang, H., Tang, L., Wang, C., Wang, J., Wang, K., Wang, Y., Wang, Z., Wang, Z., Wu, S., Wei, L., Xiao, B., Xie, W., Xie, Y., Yogatama, D., Yuan, B., Zhan, J. & Zhu, Z.. (2016). Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin. Proceedings of The 33rd International Conference on Machine Learning, in Proceedings of Machine Learning Research 48:173-182 Available from https://proceedings.mlr.press/v48/amodei16.html.

Related Material

Download PDF