UniSpeech: Unified Speech Representation Learning with Labeled and Unlabeled Data

Chengyi Wang, Yu Wu, Yao Qian, Kenichi Kumatani, Shujie Liu, Furu Wei, Michael Zeng, Xuedong Huang
Proceedings of the 38th International Conference on Machine Learning, PMLR 139:10937-10947, 2021.

Abstract

In this paper, we propose a unified pre-training approach called UniSpeech to learn speech representations with both labeled and unlabeled data, in which supervised phonetic CTC learning and phonetically-aware contrastive self-supervised learning are conducted in a multi-task learning manner. The resultant representations can capture information more correlated with phonetic structures and improve the generalization across languages and domains. We evaluate the effectiveness of UniSpeech for cross-lingual representation learning on public CommonVoice corpus. The results show that UniSpeech outperforms self-supervised pretraining and supervised transfer learning for speech recognition by a maximum of 13.4% and 26.9% relative phone error rate reductions respectively (averaged over all testing languages). The transferability of UniSpeech is also verified on a domain-shift speech recognition task, i.e., a relative word error rate reduction of 6% against the previous approach.

Cite this Paper


BibTeX
@InProceedings{pmlr-v139-wang21y, title = {UniSpeech: Unified Speech Representation Learning with Labeled and Unlabeled Data}, author = {Wang, Chengyi and Wu, Yu and Qian, Yao and Kumatani, Kenichi and Liu, Shujie and Wei, Furu and Zeng, Michael and Huang, Xuedong}, booktitle = {Proceedings of the 38th International Conference on Machine Learning}, pages = {10937--10947}, year = {2021}, editor = {Meila, Marina and Zhang, Tong}, volume = {139}, series = {Proceedings of Machine Learning Research}, month = {18--24 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v139/wang21y/wang21y.pdf}, url = {https://proceedings.mlr.press/v139/wang21y.html}, abstract = {In this paper, we propose a unified pre-training approach called UniSpeech to learn speech representations with both labeled and unlabeled data, in which supervised phonetic CTC learning and phonetically-aware contrastive self-supervised learning are conducted in a multi-task learning manner. The resultant representations can capture information more correlated with phonetic structures and improve the generalization across languages and domains. We evaluate the effectiveness of UniSpeech for cross-lingual representation learning on public CommonVoice corpus. The results show that UniSpeech outperforms self-supervised pretraining and supervised transfer learning for speech recognition by a maximum of 13.4% and 26.9% relative phone error rate reductions respectively (averaged over all testing languages). The transferability of UniSpeech is also verified on a domain-shift speech recognition task, i.e., a relative word error rate reduction of 6% against the previous approach.} }
Endnote
%0 Conference Paper %T UniSpeech: Unified Speech Representation Learning with Labeled and Unlabeled Data %A Chengyi Wang %A Yu Wu %A Yao Qian %A Kenichi Kumatani %A Shujie Liu %A Furu Wei %A Michael Zeng %A Xuedong Huang %B Proceedings of the 38th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2021 %E Marina Meila %E Tong Zhang %F pmlr-v139-wang21y %I PMLR %P 10937--10947 %U https://proceedings.mlr.press/v139/wang21y.html %V 139 %X In this paper, we propose a unified pre-training approach called UniSpeech to learn speech representations with both labeled and unlabeled data, in which supervised phonetic CTC learning and phonetically-aware contrastive self-supervised learning are conducted in a multi-task learning manner. The resultant representations can capture information more correlated with phonetic structures and improve the generalization across languages and domains. We evaluate the effectiveness of UniSpeech for cross-lingual representation learning on public CommonVoice corpus. The results show that UniSpeech outperforms self-supervised pretraining and supervised transfer learning for speech recognition by a maximum of 13.4% and 26.9% relative phone error rate reductions respectively (averaged over all testing languages). The transferability of UniSpeech is also verified on a domain-shift speech recognition task, i.e., a relative word error rate reduction of 6% against the previous approach.
APA
Wang, C., Wu, Y., Qian, Y., Kumatani, K., Liu, S., Wei, F., Zeng, M. & Huang, X.. (2021). UniSpeech: Unified Speech Representation Learning with Labeled and Unlabeled Data. Proceedings of the 38th International Conference on Machine Learning, in Proceedings of Machine Learning Research 139:10937-10947 Available from https://proceedings.mlr.press/v139/wang21y.html.

Related Material