Multichannel End-to-end Speech Recognition

Tsubasa Ochiai; Shinji Watanabe; Takaaki Hori; John R. Hershey

Multichannel End-to-end Speech Recognition

Tsubasa Ochiai, Shinji Watanabe, Takaaki Hori, John R. Hershey

Proceedings of the 34th International Conference on Machine Learning, PMLR 70:2632-2641, 2017.

Abstract

The field of speech recognition is in the midst of a paradigm shift: end-to-end neural networks are challenging the dominance of hidden Markov models as a core technology. Using an attention mechanism in a recurrent encoder-decoder architecture solves the dynamic time alignment problem, allowing joint end-to-end training of the acoustic and language modeling components. In this paper we extend the end-to-end framework to encompass microphone array signal processing for noise suppression and speech enhancement within the acoustic encoding network. This allows the beamforming components to be optimized jointly within the recognition architecture to improve the end-to-end speech recognition objective. Experiments on the noisy speech benchmarks (CHiME-4 and AMI) show that our multichannel end-to-end system outperformed the attention-based baseline with input from a conventional adaptive beamformer.

Cite this Paper

BibTeX

@InProceedings{pmlr-v70-ochiai17a,
  title = 	 {Multichannel End-to-end Speech Recognition},
  author =       {Tsubasa Ochiai and Shinji Watanabe and Takaaki Hori and John R. Hershey},
  booktitle = 	 {Proceedings of the 34th International Conference on Machine Learning},
  pages = 	 {2632--2641},
  year = 	 {2017},
  editor = 	 {Precup, Doina and Teh, Yee Whye},
  volume = 	 {70},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {06--11 Aug},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v70/ochiai17a/ochiai17a.pdf},
  url = 	 {https://proceedings.mlr.press/v70/ochiai17a.html},
  abstract = 	 {The field of speech recognition is in the midst of a paradigm shift: end-to-end neural networks are challenging the dominance of hidden Markov models as a core technology. Using an attention mechanism in a recurrent encoder-decoder architecture solves the dynamic time alignment problem, allowing joint end-to-end training of the acoustic and language modeling components. In this paper we extend the end-to-end framework to encompass microphone array signal processing for noise suppression and speech enhancement within the acoustic encoding network. This allows the beamforming components to be optimized jointly within the recognition architecture to improve the end-to-end speech recognition objective. Experiments on the noisy speech benchmarks (CHiME-4 and AMI) show that our multichannel end-to-end system outperformed the attention-based baseline with input from a conventional adaptive beamformer.}
}

Endnote

%0 Conference Paper
%T Multichannel End-to-end Speech Recognition
%A Tsubasa Ochiai
%A Shinji Watanabe
%A Takaaki Hori
%A John R. Hershey
%B Proceedings of the 34th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2017
%E Doina Precup
%E Yee Whye Teh	
%F pmlr-v70-ochiai17a
%I PMLR
%P 2632--2641
%U https://proceedings.mlr.press/v70/ochiai17a.html
%V 70
%X The field of speech recognition is in the midst of a paradigm shift: end-to-end neural networks are challenging the dominance of hidden Markov models as a core technology. Using an attention mechanism in a recurrent encoder-decoder architecture solves the dynamic time alignment problem, allowing joint end-to-end training of the acoustic and language modeling components. In this paper we extend the end-to-end framework to encompass microphone array signal processing for noise suppression and speech enhancement within the acoustic encoding network. This allows the beamforming components to be optimized jointly within the recognition architecture to improve the end-to-end speech recognition objective. Experiments on the noisy speech benchmarks (CHiME-4 and AMI) show that our multichannel end-to-end system outperformed the attention-based baseline with input from a conventional adaptive beamformer.

APA

Ochiai, T., Watanabe, S., Hori, T. & Hershey, J.R.. (2017). Multichannel End-to-end Speech Recognition. Proceedings of the 34th International Conference on Machine Learning, in Proceedings of Machine Learning Research 70:2632-2641 Available from https://proceedings.mlr.press/v70/ochiai17a.html.

Related Material

Download PDF