Fused Acoustic and Text Encoding for Multimodal Bilingual Pretraining and Speech Translation

Renjie Zheng, Junkun Chen, Mingbo Ma, Liang Huang
Proceedings of the 38th International Conference on Machine Learning, PMLR 139:12736-12746, 2021.

Abstract

Recently, representation learning for text and speech has successfully improved many language related tasks. However, all existing methods suffer from two limitations: (a) they only learn from one input modality, while a unified representation for both speech and text is needed by tasks such as end-to-end speech translation, and as a result, (b) they can not exploit various large-scale text and speech data and their performance is limited by the scarcity of parallel speech translation data. To address these problems, we propose a Fused Acoustic and Text Masked Language Model (FAT-MLM) which jointly learns a unified representation for both acoustic and text input from various types of corpora including parallel data for speech recognition and machine translation, and even pure speech and text data. Within this cross-modal representation learning framework, we further present an end-to-end model for Fused Acoustic and Text Speech Translation (FAT-ST). Experiments on three translation directions show that by fine-tuning from FAT-MLM, our proposed speech translation models substantially improve translation quality by up to +5.9 BLEU.

Cite this Paper


BibTeX
@InProceedings{pmlr-v139-zheng21a, title = {Fused Acoustic and Text Encoding for Multimodal Bilingual Pretraining and Speech Translation}, author = {Zheng, Renjie and Chen, Junkun and Ma, Mingbo and Huang, Liang}, booktitle = {Proceedings of the 38th International Conference on Machine Learning}, pages = {12736--12746}, year = {2021}, editor = {Meila, Marina and Zhang, Tong}, volume = {139}, series = {Proceedings of Machine Learning Research}, month = {18--24 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v139/zheng21a/zheng21a.pdf}, url = {https://proceedings.mlr.press/v139/zheng21a.html}, abstract = {Recently, representation learning for text and speech has successfully improved many language related tasks. However, all existing methods suffer from two limitations: (a) they only learn from one input modality, while a unified representation for both speech and text is needed by tasks such as end-to-end speech translation, and as a result, (b) they can not exploit various large-scale text and speech data and their performance is limited by the scarcity of parallel speech translation data. To address these problems, we propose a Fused Acoustic and Text Masked Language Model (FAT-MLM) which jointly learns a unified representation for both acoustic and text input from various types of corpora including parallel data for speech recognition and machine translation, and even pure speech and text data. Within this cross-modal representation learning framework, we further present an end-to-end model for Fused Acoustic and Text Speech Translation (FAT-ST). Experiments on three translation directions show that by fine-tuning from FAT-MLM, our proposed speech translation models substantially improve translation quality by up to +5.9 BLEU.} }
Endnote
%0 Conference Paper %T Fused Acoustic and Text Encoding for Multimodal Bilingual Pretraining and Speech Translation %A Renjie Zheng %A Junkun Chen %A Mingbo Ma %A Liang Huang %B Proceedings of the 38th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2021 %E Marina Meila %E Tong Zhang %F pmlr-v139-zheng21a %I PMLR %P 12736--12746 %U https://proceedings.mlr.press/v139/zheng21a.html %V 139 %X Recently, representation learning for text and speech has successfully improved many language related tasks. However, all existing methods suffer from two limitations: (a) they only learn from one input modality, while a unified representation for both speech and text is needed by tasks such as end-to-end speech translation, and as a result, (b) they can not exploit various large-scale text and speech data and their performance is limited by the scarcity of parallel speech translation data. To address these problems, we propose a Fused Acoustic and Text Masked Language Model (FAT-MLM) which jointly learns a unified representation for both acoustic and text input from various types of corpora including parallel data for speech recognition and machine translation, and even pure speech and text data. Within this cross-modal representation learning framework, we further present an end-to-end model for Fused Acoustic and Text Speech Translation (FAT-ST). Experiments on three translation directions show that by fine-tuning from FAT-MLM, our proposed speech translation models substantially improve translation quality by up to +5.9 BLEU.
APA
Zheng, R., Chen, J., Ma, M. & Huang, L.. (2021). Fused Acoustic and Text Encoding for Multimodal Bilingual Pretraining and Speech Translation. Proceedings of the 38th International Conference on Machine Learning, in Proceedings of Machine Learning Research 139:12736-12746 Available from https://proceedings.mlr.press/v139/zheng21a.html.

Related Material