Self-supervised learning with random-projection quantizer for speech recognition

Chung-Cheng Chiu, James Qin, Yu Zhang, Jiahui Yu, Yonghui Wu
Proceedings of the 39th International Conference on Machine Learning, PMLR 162:3915-3924, 2022.

Abstract

We present a simple and effective self-supervised learning approach for speech recognition. The approach learns a model to predict the masked speech signals, in the form of discrete labels generated with a random-projection quantizer. In particular the quantizer projects speech inputs with a randomly initialized matrix, and does a nearest-neighbor lookup in a randomly-initialized codebook. Neither the matrix nor the codebook are updated during self-supervised learning. Since the random-projection quantizer is not trained and is separated from the speech recognition model, the design makes the approach flexible and is compatible with universal speech recognition architecture. On LibriSpeech our approach achieves similar word-error-rates as previous work using self-supervised learning with non-streaming models, and provides lower word-error-rates than previous work with streaming models. On multilingual tasks the approach also provides significant improvement over wav2vec 2.0 and w2v-BERT.

Cite this Paper


BibTeX
@InProceedings{pmlr-v162-chiu22a, title = {Self-supervised learning with random-projection quantizer for speech recognition}, author = {Chiu, Chung-Cheng and Qin, James and Zhang, Yu and Yu, Jiahui and Wu, Yonghui}, booktitle = {Proceedings of the 39th International Conference on Machine Learning}, pages = {3915--3924}, year = {2022}, editor = {Chaudhuri, Kamalika and Jegelka, Stefanie and Song, Le and Szepesvari, Csaba and Niu, Gang and Sabato, Sivan}, volume = {162}, series = {Proceedings of Machine Learning Research}, month = {17--23 Jul}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v162/chiu22a/chiu22a.pdf}, url = {https://proceedings.mlr.press/v162/chiu22a.html}, abstract = {We present a simple and effective self-supervised learning approach for speech recognition. The approach learns a model to predict the masked speech signals, in the form of discrete labels generated with a random-projection quantizer. In particular the quantizer projects speech inputs with a randomly initialized matrix, and does a nearest-neighbor lookup in a randomly-initialized codebook. Neither the matrix nor the codebook are updated during self-supervised learning. Since the random-projection quantizer is not trained and is separated from the speech recognition model, the design makes the approach flexible and is compatible with universal speech recognition architecture. On LibriSpeech our approach achieves similar word-error-rates as previous work using self-supervised learning with non-streaming models, and provides lower word-error-rates than previous work with streaming models. On multilingual tasks the approach also provides significant improvement over wav2vec 2.0 and w2v-BERT.} }
Endnote
%0 Conference Paper %T Self-supervised learning with random-projection quantizer for speech recognition %A Chung-Cheng Chiu %A James Qin %A Yu Zhang %A Jiahui Yu %A Yonghui Wu %B Proceedings of the 39th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2022 %E Kamalika Chaudhuri %E Stefanie Jegelka %E Le Song %E Csaba Szepesvari %E Gang Niu %E Sivan Sabato %F pmlr-v162-chiu22a %I PMLR %P 3915--3924 %U https://proceedings.mlr.press/v162/chiu22a.html %V 162 %X We present a simple and effective self-supervised learning approach for speech recognition. The approach learns a model to predict the masked speech signals, in the form of discrete labels generated with a random-projection quantizer. In particular the quantizer projects speech inputs with a randomly initialized matrix, and does a nearest-neighbor lookup in a randomly-initialized codebook. Neither the matrix nor the codebook are updated during self-supervised learning. Since the random-projection quantizer is not trained and is separated from the speech recognition model, the design makes the approach flexible and is compatible with universal speech recognition architecture. On LibriSpeech our approach achieves similar word-error-rates as previous work using self-supervised learning with non-streaming models, and provides lower word-error-rates than previous work with streaming models. On multilingual tasks the approach also provides significant improvement over wav2vec 2.0 and w2v-BERT.
APA
Chiu, C., Qin, J., Zhang, Y., Yu, J. & Wu, Y.. (2022). Self-supervised learning with random-projection quantizer for speech recognition. Proceedings of the 39th International Conference on Machine Learning, in Proceedings of Machine Learning Research 162:3915-3924 Available from https://proceedings.mlr.press/v162/chiu22a.html.

Related Material