Effective Sentence Scoring Method Using BERT for Speech Recognition


Joonbo Shin, Yoonhyung Lee, Kyomin Jung ;
Proceedings of The Eleventh Asian Conference on Machine Learning, PMLR 101:1081-1093, 2019.


In automatic speech recognition, language models (LMs) have been used in many ways to improve performance. Some of the studies have tried to use bidirectional LMs (biLMs) for rescoring the $n$-best hypothesis list decoded from the acoustic model. Despite their theoretical advantages over conventional unidirectional LMs (uniLMs), previous biLMs have not given notable improvements compared to the uniLMs in the experiments. This is due to the architectural limitation that the rightward and leftward representations are not fused in the biLMs. Recently, BERT addressed the same issue by proposing the masked language modeling and achieved state-of-the-art performances in many downstream tasks by fine-tuning the pre-trained BERT. In this paper, we propose an effective sentence scoring method by adjusting the BERT to the $n$-best list rescoring task, which has no fine-tuning step. The core idea of how we modify the BERT for the rescoring task is bridging the gap between training and testing environments by considering the only masked language modeling within a single sentence. Experimental results on the LibriSpeech corpus show that the proposed scoring method using our biLM outperforms uniLMs for the $n$-best list rescoring, consistently and significantly in all experimental conditions. Additionally, an analysis about where word errors occur in a sentence demonstrates that our biLM is more robust than the uniLM especially when a recognized sentence is short or a misrecognized word is at the beginning of the sentence. Consequently, we empirically prove that the left and right representations should be fused in biLMs for scoring a sentence.

Related Material