Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin

Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, Jingliang Bai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Qiang Cheng, Guoliang Chen, Jie Chen, Jingdong Chen, Zhijie Chen, Mike Chrzanowski, Adam Coates, Greg Diamos, Ke Ding, Niandong Du, Erich Elsen, Jesse Engel, Weiwei Fang, Linxi Fan, Christopher Fougner, Liang Gao, Caixia Gong, Awni Hannun, Tony Han, Lappi Johannes, Bing Jiang, Cai Ju, Billy Jun, Patrick LeGresley, Libby Lin, Junjie Liu, Yang Liu, Weigao Li, Xiangang Li, Dongpeng Ma, Sharan Narang, Andrew Ng, Sherjil Ozair, Yiping Peng, Ryan Prenger, Sheng Qian, Zongfeng Quan, Jonathan Raiman, Vinay Rao, Sanjeev Satheesh, David Seetapun, Shubho Sengupta, Kavya Srinet, Anuroop Sriram, Haiyuan Tang, Liliang Tang, Chong Wang, Jidong Wang, Kaifu Wang, Yi Wang, Zhijian Wang, Zhiqian Wang, Shuang Wu, Likai Wei, Bo Xiao, Wen Xie, Yan Xie, Dani Yogatama, Bin Yuan, Jun Zhan, Zhenyao Zhu
; Proceedings of The 33rd International Conference on Machine Learning, PMLR 48:173-182, 2016.

Abstract

We show that an end-to-end deep learning approach can be used to recognize either English or Mandarin Chinese speech–two vastly different languages. Because it replaces entire pipelines of hand-engineered components with neural networks, end-to-end learning allows us to handle a diverse variety of speech including noisy environments, accents and different languages. Key to our approach is our application of HPC techniques, enabling experiments that previously took weeks to now run in days. This allows us to iterate more quickly to identify superior architectures and algorithms. As a result, in several cases, our system is competitive with the transcription of human workers when benchmarked on standard datasets. Finally, using a technique called Batch Dispatch with GPUs in the data center, we show that our system can be inexpensively deployed in an online setting, delivering low latency when serving users at scale.

Cite this Paper


BibTeX
@InProceedings{pmlr-v48-amodei16, title = {Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin}, author = {Dario Amodei and Sundaram Ananthanarayanan and Rishita Anubhai and Jingliang Bai and Eric Battenberg and Carl Case and Jared Casper and Bryan Catanzaro and Qiang Cheng and Guoliang Chen and Jie Chen and Jingdong Chen and Zhijie Chen and Mike Chrzanowski and Adam Coates and Greg Diamos and Ke Ding and Niandong Du and Erich Elsen and Jesse Engel and Weiwei Fang and Linxi Fan and Christopher Fougner and Liang Gao and Caixia Gong and Awni Hannun and Tony Han and Lappi Johannes and Bing Jiang and Cai Ju and Billy Jun and Patrick LeGresley and Libby Lin and Junjie Liu and Yang Liu and Weigao Li and Xiangang Li and Dongpeng Ma and Sharan Narang and Andrew Ng and Sherjil Ozair and Yiping Peng and Ryan Prenger and Sheng Qian and Zongfeng Quan and Jonathan Raiman and Vinay Rao and Sanjeev Satheesh and David Seetapun and Shubho Sengupta and Kavya Srinet and Anuroop Sriram and Haiyuan Tang and Liliang Tang and Chong Wang and Jidong Wang and Kaifu Wang and Yi Wang and Zhijian Wang and Zhiqian Wang and Shuang Wu and Likai Wei and Bo Xiao and Wen Xie and Yan Xie and Dani Yogatama and Bin Yuan and Jun Zhan and Zhenyao Zhu}, pages = {173--182}, year = {2016}, editor = {Maria Florina Balcan and Kilian Q. Weinberger}, volume = {48}, series = {Proceedings of Machine Learning Research}, address = {New York, New York, USA}, month = {20--22 Jun}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v48/amodei16.pdf}, url = {http://proceedings.mlr.press/v48/amodei16.html}, abstract = {We show that an end-to-end deep learning approach can be used to recognize either English or Mandarin Chinese speech–two vastly different languages. Because it replaces entire pipelines of hand-engineered components with neural networks, end-to-end learning allows us to handle a diverse variety of speech including noisy environments, accents and different languages. Key to our approach is our application of HPC techniques, enabling experiments that previously took weeks to now run in days. This allows us to iterate more quickly to identify superior architectures and algorithms. As a result, in several cases, our system is competitive with the transcription of human workers when benchmarked on standard datasets. Finally, using a technique called Batch Dispatch with GPUs in the data center, we show that our system can be inexpensively deployed in an online setting, delivering low latency when serving users at scale.} }
Endnote
%0 Conference Paper %T Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin %A Dario Amodei %A Sundaram Ananthanarayanan %A Rishita Anubhai %A Jingliang Bai %A Eric Battenberg %A Carl Case %A Jared Casper %A Bryan Catanzaro %A Qiang Cheng %A Guoliang Chen %A Jie Chen %A Jingdong Chen %A Zhijie Chen %A Mike Chrzanowski %A Adam Coates %A Greg Diamos %A Ke Ding %A Niandong Du %A Erich Elsen %A Jesse Engel %A Weiwei Fang %A Linxi Fan %A Christopher Fougner %A Liang Gao %A Caixia Gong %A Awni Hannun %A Tony Han %A Lappi Johannes %A Bing Jiang %A Cai Ju %A Billy Jun %A Patrick LeGresley %A Libby Lin %A Junjie Liu %A Yang Liu %A Weigao Li %A Xiangang Li %A Dongpeng Ma %A Sharan Narang %A Andrew Ng %A Sherjil Ozair %A Yiping Peng %A Ryan Prenger %A Sheng Qian %A Zongfeng Quan %A Jonathan Raiman %A Vinay Rao %A Sanjeev Satheesh %A David Seetapun %A Shubho Sengupta %A Kavya Srinet %A Anuroop Sriram %A Haiyuan Tang %A Liliang Tang %A Chong Wang %A Jidong Wang %A Kaifu Wang %A Yi Wang %A Zhijian Wang %A Zhiqian Wang %A Shuang Wu %A Likai Wei %A Bo Xiao %A Wen Xie %A Yan Xie %A Dani Yogatama %A Bin Yuan %A Jun Zhan %A Zhenyao Zhu %B Proceedings of The 33rd International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2016 %E Maria Florina Balcan %E Kilian Q. Weinberger %F pmlr-v48-amodei16 %I PMLR %J Proceedings of Machine Learning Research %P 173--182 %U http://proceedings.mlr.press %V 48 %W PMLR %X We show that an end-to-end deep learning approach can be used to recognize either English or Mandarin Chinese speech–two vastly different languages. Because it replaces entire pipelines of hand-engineered components with neural networks, end-to-end learning allows us to handle a diverse variety of speech including noisy environments, accents and different languages. Key to our approach is our application of HPC techniques, enabling experiments that previously took weeks to now run in days. This allows us to iterate more quickly to identify superior architectures and algorithms. As a result, in several cases, our system is competitive with the transcription of human workers when benchmarked on standard datasets. Finally, using a technique called Batch Dispatch with GPUs in the data center, we show that our system can be inexpensively deployed in an online setting, delivering low latency when serving users at scale.
RIS
TY - CPAPER TI - Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin AU - Dario Amodei AU - Sundaram Ananthanarayanan AU - Rishita Anubhai AU - Jingliang Bai AU - Eric Battenberg AU - Carl Case AU - Jared Casper AU - Bryan Catanzaro AU - Qiang Cheng AU - Guoliang Chen AU - Jie Chen AU - Jingdong Chen AU - Zhijie Chen AU - Mike Chrzanowski AU - Adam Coates AU - Greg Diamos AU - Ke Ding AU - Niandong Du AU - Erich Elsen AU - Jesse Engel AU - Weiwei Fang AU - Linxi Fan AU - Christopher Fougner AU - Liang Gao AU - Caixia Gong AU - Awni Hannun AU - Tony Han AU - Lappi Johannes AU - Bing Jiang AU - Cai Ju AU - Billy Jun AU - Patrick LeGresley AU - Libby Lin AU - Junjie Liu AU - Yang Liu AU - Weigao Li AU - Xiangang Li AU - Dongpeng Ma AU - Sharan Narang AU - Andrew Ng AU - Sherjil Ozair AU - Yiping Peng AU - Ryan Prenger AU - Sheng Qian AU - Zongfeng Quan AU - Jonathan Raiman AU - Vinay Rao AU - Sanjeev Satheesh AU - David Seetapun AU - Shubho Sengupta AU - Kavya Srinet AU - Anuroop Sriram AU - Haiyuan Tang AU - Liliang Tang AU - Chong Wang AU - Jidong Wang AU - Kaifu Wang AU - Yi Wang AU - Zhijian Wang AU - Zhiqian Wang AU - Shuang Wu AU - Likai Wei AU - Bo Xiao AU - Wen Xie AU - Yan Xie AU - Dani Yogatama AU - Bin Yuan AU - Jun Zhan AU - Zhenyao Zhu BT - Proceedings of The 33rd International Conference on Machine Learning PY - 2016/06/11 DA - 2016/06/11 ED - Maria Florina Balcan ED - Kilian Q. Weinberger ID - pmlr-v48-amodei16 PB - PMLR SP - 173 DP - PMLR EP - 182 L1 - http://proceedings.mlr.press/v48/amodei16.pdf UR - http://proceedings.mlr.press/v48/amodei16.html AB - We show that an end-to-end deep learning approach can be used to recognize either English or Mandarin Chinese speech–two vastly different languages. Because it replaces entire pipelines of hand-engineered components with neural networks, end-to-end learning allows us to handle a diverse variety of speech including noisy environments, accents and different languages. Key to our approach is our application of HPC techniques, enabling experiments that previously took weeks to now run in days. This allows us to iterate more quickly to identify superior architectures and algorithms. As a result, in several cases, our system is competitive with the transcription of human workers when benchmarked on standard datasets. Finally, using a technique called Batch Dispatch with GPUs in the data center, we show that our system can be inexpensively deployed in an online setting, delivering low latency when serving users at scale. ER -
APA
Amodei, D., Ananthanarayanan, S., Anubhai, R., Bai, J., Battenberg, E., Case, C., Casper, J., Catanzaro, B., Cheng, Q., Chen, G., Chen, J., Chen, J., Chen, Z., Chrzanowski, M., Coates, A., Diamos, G., Ding, K., Du, N., Elsen, E., Engel, J., Fang, W., Fan, L., Fougner, C., Gao, L., Gong, C., Hannun, A., Han, T., Johannes, L., Jiang, B., Ju, C., Jun, B., LeGresley, P., Lin, L., Liu, J., Liu, Y., Li, W., Li, X., Ma, D., Narang, S., Ng, A., Ozair, S., Peng, Y., Prenger, R., Qian, S., Quan, Z., Raiman, J., Rao, V., Satheesh, S., Seetapun, D., Sengupta, S., Srinet, K., Sriram, A., Tang, H., Tang, L., Wang, C., Wang, J., Wang, K., Wang, Y., Wang, Z., Wang, Z., Wu, S., Wei, L., Xiao, B., Xie, W., Xie, Y., Yogatama, D., Yuan, B., Zhan, J. & Zhu, Z.. (2016). Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin. Proceedings of The 33rd International Conference on Machine Learning, in PMLR 48:173-182

Related Material