Towards End-To-End Speech Recognition with Recurrent Neural Networks

Alex Graves; Navdeep Jaitly

Towards End-To-End Speech Recognition with Recurrent Neural Networks

Alex Graves, Navdeep Jaitly

Proceedings of the 31st International Conference on Machine Learning, PMLR 32(2):1764-1772, 2014.

Abstract

This paper presents a speech recognition system that directly transcribes audio data with text, without requiring an intermediate phonetic representation. The system is based on a combination of the deep bidirectional LSTM recurrent neural network architecture and the Connectionist Temporal Classification objective function. A modification to the objective function is introduced that trains the network to minimise the expectation of an arbitrary transcription loss function. This allows a direct optimisation of the word error rate, even in the absence of a lexicon or language model. The system achieves a word error rate of 27.3% on the Wall Street Journal corpus with no prior linguistic information, 21.9% with only a lexicon of allowed words, and 8.2% with a trigram language model. Combining the network with a baseline system further reduces the error rate to 6.7%.

Cite this Paper

BibTeX


@InProceedings{pmlr-v32-graves14,
  title = 	 {Towards End-To-End Speech Recognition with Recurrent Neural Networks},
  author = 	 {Graves, Alex and Jaitly, Navdeep},
  booktitle = 	 {Proceedings of the 31st International Conference on Machine Learning},
  pages = 	 {1764--1772},
  year = 	 {2014},
  editor = 	 {Xing, Eric P. and Jebara, Tony},
  volume = 	 {32},
  number =       {2},
  series = 	 {Proceedings of Machine Learning Research},
  address = 	 {Bejing, China},
  month = 	 {22--24 Jun},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v32/graves14.pdf},
  url = 	 {https://proceedings.mlr.press/v32/graves14.html},
  abstract = 	 {This paper presents a speech recognition system that directly transcribes audio data with text, without requiring an intermediate phonetic representation. The system is based on a combination of the deep bidirectional LSTM recurrent neural network architecture and the Connectionist Temporal Classification objective function. A modification to the objective function is introduced that trains the network to minimise the expectation of an arbitrary transcription loss function. This allows a direct optimisation of the word error rate, even in the absence of a lexicon or language model. The system achieves a word error rate of 27.3% on the Wall Street Journal corpus with no prior linguistic information, 21.9% with only a lexicon of allowed words, and 8.2% with a trigram language model. Combining the network with a baseline system further reduces the error rate to 6.7%.}
}

Endnote

%0 Conference Paper
%T Towards End-To-End Speech Recognition with Recurrent Neural Networks
%A Alex Graves
%A Navdeep Jaitly
%B Proceedings of the 31st International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2014
%E Eric P. Xing
%E Tony Jebara	
%F pmlr-v32-graves14
%I PMLR
%P 1764--1772
%U https://proceedings.mlr.press/v32/graves14.html
%V 32
%N 2
%X This paper presents a speech recognition system that directly transcribes audio data with text, without requiring an intermediate phonetic representation. The system is based on a combination of the deep bidirectional LSTM recurrent neural network architecture and the Connectionist Temporal Classification objective function. A modification to the objective function is introduced that trains the network to minimise the expectation of an arbitrary transcription loss function. This allows a direct optimisation of the word error rate, even in the absence of a lexicon or language model. The system achieves a word error rate of 27.3% on the Wall Street Journal corpus with no prior linguistic information, 21.9% with only a lexicon of allowed words, and 8.2% with a trigram language model. Combining the network with a baseline system further reduces the error rate to 6.7%.

RIS


TY  - CPAPER
TI  - Towards End-To-End Speech Recognition with Recurrent Neural Networks
AU  - Alex Graves
AU  - Navdeep Jaitly
BT  - Proceedings of the 31st International Conference on Machine Learning
DA  - 2014/06/18
ED  - Eric P. Xing
ED  - Tony Jebara	
ID  - pmlr-v32-graves14
PB  - PMLR
DP  - Proceedings of Machine Learning Research
VL  - 32
IS  - 2
SP  - 1764
EP  - 1772
L1  - http://proceedings.mlr.press/v32/graves14.pdf
UR  - https://proceedings.mlr.press/v32/graves14.html
AB  - This paper presents a speech recognition system that directly transcribes audio data with text, without requiring an intermediate phonetic representation. The system is based on a combination of the deep bidirectional LSTM recurrent neural network architecture and the Connectionist Temporal Classification objective function. A modification to the objective function is introduced that trains the network to minimise the expectation of an arbitrary transcription loss function. This allows a direct optimisation of the word error rate, even in the absence of a lexicon or language model. The system achieves a word error rate of 27.3% on the Wall Street Journal corpus with no prior linguistic information, 21.9% with only a lexicon of allowed words, and 8.2% with a trigram language model. Combining the network with a baseline system further reduces the error rate to 6.7%.
ER  -

APA


Graves, A. & Jaitly, N.. (2014). Towards End-To-End Speech Recognition with Recurrent Neural Networks. Proceedings of the 31st International Conference on Machine Learning, in Proceedings of Machine Learning Research 32(2):1764-1772 Available from https://proceedings.mlr.press/v32/graves14.html.

Related Material

Download PDF