Proceedings of the 31st International Conference on Machine Learning, PMLR 32(2):1764-1772, 2014.
Abstract
This paper presents a speech recognition system that directly transcribes audio data with text, without requiring an intermediate phonetic representation. The system is based on a combination of the deep bidirectional LSTM recurrent neural network architecture and the Connectionist Temporal Classification objective function. A modification to the objective function is introduced that trains the network to minimise the expectation of an arbitrary transcription loss function. This allows a direct optimisation of the word error rate, even in the absence of a lexicon or language model. The system achieves a word error rate of 27.3% on the Wall Street Journal corpus with no prior linguistic information, 21.9% with only a lexicon of allowed words, and 8.2% with a trigram language model. Combining the network with a baseline system further reduces the error rate to 6.7%.
@InProceedings{pmlr-v32-graves14,
title = {Towards End-To-End Speech Recognition with Recurrent Neural Networks},
author = {Alex Graves and Navdeep Jaitly},
booktitle = {Proceedings of the 31st International Conference on Machine Learning},
pages = {1764--1772},
year = {2014},
editor = {Eric P. Xing and Tony Jebara},
volume = {32},
number = {2},
series = {Proceedings of Machine Learning Research},
address = {Bejing, China},
month = {22--24 Jun},
publisher = {PMLR},
pdf = {http://proceedings.mlr.press/v32/graves14.pdf},
url = {http://proceedings.mlr.press/v32/graves14.html},
abstract = {This paper presents a speech recognition system that directly transcribes audio data with text, without requiring an intermediate phonetic representation. The system is based on a combination of the deep bidirectional LSTM recurrent neural network architecture and the Connectionist Temporal Classification objective function. A modification to the objective function is introduced that trains the network to minimise the expectation of an arbitrary transcription loss function. This allows a direct optimisation of the word error rate, even in the absence of a lexicon or language model. The system achieves a word error rate of 27.3% on the Wall Street Journal corpus with no prior linguistic information, 21.9% with only a lexicon of allowed words, and 8.2% with a trigram language model. Combining the network with a baseline system further reduces the error rate to 6.7%.}
}
%0 Conference Paper
%T Towards End-To-End Speech Recognition with Recurrent Neural Networks
%A Alex Graves
%A Navdeep Jaitly
%B Proceedings of the 31st International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2014
%E Eric P. Xing
%E Tony Jebara
%F pmlr-v32-graves14
%I PMLR
%J Proceedings of Machine Learning Research
%P 1764--1772
%U http://proceedings.mlr.press
%V 32
%N 2
%W PMLR
%X This paper presents a speech recognition system that directly transcribes audio data with text, without requiring an intermediate phonetic representation. The system is based on a combination of the deep bidirectional LSTM recurrent neural network architecture and the Connectionist Temporal Classification objective function. A modification to the objective function is introduced that trains the network to minimise the expectation of an arbitrary transcription loss function. This allows a direct optimisation of the word error rate, even in the absence of a lexicon or language model. The system achieves a word error rate of 27.3% on the Wall Street Journal corpus with no prior linguistic information, 21.9% with only a lexicon of allowed words, and 8.2% with a trigram language model. Combining the network with a baseline system further reduces the error rate to 6.7%.
TY - CPAPER
TI - Towards End-To-End Speech Recognition with Recurrent Neural Networks
AU - Alex Graves
AU - Navdeep Jaitly
BT - Proceedings of the 31st International Conference on Machine Learning
PY - 2014/01/27
DA - 2014/01/27
ED - Eric P. Xing
ED - Tony Jebara
ID - pmlr-v32-graves14
PB - PMLR
SP - 1764
DP - PMLR
EP - 1772
L1 - http://proceedings.mlr.press/v32/graves14.pdf
UR - http://proceedings.mlr.press/v32/graves14.html
AB - This paper presents a speech recognition system that directly transcribes audio data with text, without requiring an intermediate phonetic representation. The system is based on a combination of the deep bidirectional LSTM recurrent neural network architecture and the Connectionist Temporal Classification objective function. A modification to the objective function is introduced that trains the network to minimise the expectation of an arbitrary transcription loss function. This allows a direct optimisation of the word error rate, even in the absence of a lexicon or language model. The system achieves a word error rate of 27.3% on the Wall Street Journal corpus with no prior linguistic information, 21.9% with only a lexicon of allowed words, and 8.2% with a trigram language model. Combining the network with a baseline system further reduces the error rate to 6.7%.
ER -
Graves, A. & Jaitly, N.. (2014). Towards End-To-End Speech Recognition with Recurrent Neural Networks. Proceedings of the 31st International Conference on Machine Learning, in PMLR 32(2):1764-1772
This site last compiled Mon, 29 May 2017 07:21:57 +0000