Towards End-To-End Speech Recognition with Recurrent Neural Networks

Alex Graves, Navdeep Jaitly
Proceedings of the 31st International Conference on Machine Learning, PMLR 32(2):1764-1772, 2014.

Abstract

This paper presents a speech recognition system that directly transcribes audio data with text, without requiring an intermediate phonetic representation. The system is based on a combination of the deep bidirectional LSTM recurrent neural network architecture and the Connectionist Temporal Classification objective function. A modification to the objective function is introduced that trains the network to minimise the expectation of an arbitrary transcription loss function. This allows a direct optimisation of the word error rate, even in the absence of a lexicon or language model. The system achieves a word error rate of 27.3% on the Wall Street Journal corpus with no prior linguistic information, 21.9% with only a lexicon of allowed words, and 8.2% with a trigram language model. Combining the network with a baseline system further reduces the error rate to 6.7%.

Cite this Paper


BibTeX
@InProceedings{pmlr-v32-graves14, title = {Towards End-To-End Speech Recognition with Recurrent Neural Networks}, author = {Graves, Alex and Jaitly, Navdeep}, booktitle = {Proceedings of the 31st International Conference on Machine Learning}, pages = {1764--1772}, year = {2014}, editor = {Xing, Eric P. and Jebara, Tony}, volume = {32}, number = {2}, series = {Proceedings of Machine Learning Research}, address = {Bejing, China}, month = {22--24 Jun}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v32/graves14.pdf}, url = {https://proceedings.mlr.press/v32/graves14.html}, abstract = {This paper presents a speech recognition system that directly transcribes audio data with text, without requiring an intermediate phonetic representation. The system is based on a combination of the deep bidirectional LSTM recurrent neural network architecture and the Connectionist Temporal Classification objective function. A modification to the objective function is introduced that trains the network to minimise the expectation of an arbitrary transcription loss function. This allows a direct optimisation of the word error rate, even in the absence of a lexicon or language model. The system achieves a word error rate of 27.3% on the Wall Street Journal corpus with no prior linguistic information, 21.9% with only a lexicon of allowed words, and 8.2% with a trigram language model. Combining the network with a baseline system further reduces the error rate to 6.7%.} }
Endnote
%0 Conference Paper %T Towards End-To-End Speech Recognition with Recurrent Neural Networks %A Alex Graves %A Navdeep Jaitly %B Proceedings of the 31st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2014 %E Eric P. Xing %E Tony Jebara %F pmlr-v32-graves14 %I PMLR %P 1764--1772 %U https://proceedings.mlr.press/v32/graves14.html %V 32 %N 2 %X This paper presents a speech recognition system that directly transcribes audio data with text, without requiring an intermediate phonetic representation. The system is based on a combination of the deep bidirectional LSTM recurrent neural network architecture and the Connectionist Temporal Classification objective function. A modification to the objective function is introduced that trains the network to minimise the expectation of an arbitrary transcription loss function. This allows a direct optimisation of the word error rate, even in the absence of a lexicon or language model. The system achieves a word error rate of 27.3% on the Wall Street Journal corpus with no prior linguistic information, 21.9% with only a lexicon of allowed words, and 8.2% with a trigram language model. Combining the network with a baseline system further reduces the error rate to 6.7%.
RIS
TY - CPAPER TI - Towards End-To-End Speech Recognition with Recurrent Neural Networks AU - Alex Graves AU - Navdeep Jaitly BT - Proceedings of the 31st International Conference on Machine Learning DA - 2014/06/18 ED - Eric P. Xing ED - Tony Jebara ID - pmlr-v32-graves14 PB - PMLR DP - Proceedings of Machine Learning Research VL - 32 IS - 2 SP - 1764 EP - 1772 L1 - http://proceedings.mlr.press/v32/graves14.pdf UR - https://proceedings.mlr.press/v32/graves14.html AB - This paper presents a speech recognition system that directly transcribes audio data with text, without requiring an intermediate phonetic representation. The system is based on a combination of the deep bidirectional LSTM recurrent neural network architecture and the Connectionist Temporal Classification objective function. A modification to the objective function is introduced that trains the network to minimise the expectation of an arbitrary transcription loss function. This allows a direct optimisation of the word error rate, even in the absence of a lexicon or language model. The system achieves a word error rate of 27.3% on the Wall Street Journal corpus with no prior linguistic information, 21.9% with only a lexicon of allowed words, and 8.2% with a trigram language model. Combining the network with a baseline system further reduces the error rate to 6.7%. ER -
APA
Graves, A. & Jaitly, N.. (2014). Towards End-To-End Speech Recognition with Recurrent Neural Networks. Proceedings of the 31st International Conference on Machine Learning, in Proceedings of Machine Learning Research 32(2):1764-1772 Available from https://proceedings.mlr.press/v32/graves14.html.

Related Material