Examining Scaling and Transfer of Language Model Architectures for Machine Translation

Biao Zhang; Behrooz Ghorbani; Ankur Bapna; Yong Cheng; Xavier Garcia; Jonathan Shen; Orhan Firat

Examining Scaling and Transfer of Language Model Architectures for Machine Translation

Biao Zhang, Behrooz Ghorbani, Ankur Bapna, Yong Cheng, Xavier Garcia, Jonathan Shen, Orhan Firat

Proceedings of the 39th International Conference on Machine Learning, PMLR 162:26176-26192, 2022.

Abstract

Natural language understanding and generation models follow one of the two dominant architectural paradigms: language models (LMs) that process concatenated sequences in a single stack of layers, and encoder-decoder models (EncDec) that utilize separate layer stacks for input and output processing. In machine translation, EncDec has long been the favoured approach, but with few studies investigating the performance of LMs. In this work, we thoroughly examine the role of several architectural design choices on the performance of LMs on bilingual, (massively) multilingual and zero-shot translation tasks, under systematic variations of data conditions and model sizes. Our results show that: (i) Different LMs have different scaling properties, where architectural differences often have a significant impact on model performance at small scales, but the performance gap narrows as the number of parameters increases, (ii) Several design choices, including causal masking and language-modeling objectives for the source sequence, have detrimental effects on translation quality, and (iii) When paired with full-visible masking for source sequences, LMs could perform on par with EncDec on supervised bilingual and multilingual translation tasks, and improve greatly on zero-shot directions by facilitating the reduction of off-target translations.

Cite this Paper

BibTeX

@InProceedings{pmlr-v162-zhang22h,
  title = 	 {Examining Scaling and Transfer of Language Model Architectures for Machine Translation},
  author =       {Zhang, Biao and Ghorbani, Behrooz and Bapna, Ankur and Cheng, Yong and Garcia, Xavier and Shen, Jonathan and Firat, Orhan},
  booktitle = 	 {Proceedings of the 39th International Conference on Machine Learning},
  pages = 	 {26176--26192},
  year = 	 {2022},
  editor = 	 {Chaudhuri, Kamalika and Jegelka, Stefanie and Song, Le and Szepesvari, Csaba and Niu, Gang and Sabato, Sivan},
  volume = 	 {162},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {17--23 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v162/zhang22h/zhang22h.pdf},
  url = 	 {https://proceedings.mlr.press/v162/zhang22h.html},
  abstract = 	 {Natural language understanding and generation models follow one of the two dominant architectural paradigms: language models (LMs) that process concatenated sequences in a single stack of layers, and encoder-decoder models (EncDec) that utilize separate layer stacks for input and output processing. In machine translation, EncDec has long been the favoured approach, but with few studies investigating the performance of LMs. In this work, we thoroughly examine the role of several architectural design choices on the performance of LMs on bilingual, (massively) multilingual and zero-shot translation tasks, under systematic variations of data conditions and model sizes. Our results show that: (i) Different LMs have different scaling properties, where architectural differences often have a significant impact on model performance at small scales, but the performance gap narrows as the number of parameters increases, (ii) Several design choices, including causal masking and language-modeling objectives for the source sequence, have detrimental effects on translation quality, and (iii) When paired with full-visible masking for source sequences, LMs could perform on par with EncDec on supervised bilingual and multilingual translation tasks, and improve greatly on zero-shot directions by facilitating the reduction of off-target translations.}
}

Endnote

%0 Conference Paper
%T Examining Scaling and Transfer of Language Model Architectures for Machine Translation
%A Biao Zhang
%A Behrooz Ghorbani
%A Ankur Bapna
%A Yong Cheng
%A Xavier Garcia
%A Jonathan Shen
%A Orhan Firat
%B Proceedings of the 39th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2022
%E Kamalika Chaudhuri
%E Stefanie Jegelka
%E Le Song
%E Csaba Szepesvari
%E Gang Niu
%E Sivan Sabato	
%F pmlr-v162-zhang22h
%I PMLR
%P 26176--26192
%U https://proceedings.mlr.press/v162/zhang22h.html
%V 162
%X Natural language understanding and generation models follow one of the two dominant architectural paradigms: language models (LMs) that process concatenated sequences in a single stack of layers, and encoder-decoder models (EncDec) that utilize separate layer stacks for input and output processing. In machine translation, EncDec has long been the favoured approach, but with few studies investigating the performance of LMs. In this work, we thoroughly examine the role of several architectural design choices on the performance of LMs on bilingual, (massively) multilingual and zero-shot translation tasks, under systematic variations of data conditions and model sizes. Our results show that: (i) Different LMs have different scaling properties, where architectural differences often have a significant impact on model performance at small scales, but the performance gap narrows as the number of parameters increases, (ii) Several design choices, including causal masking and language-modeling objectives for the source sequence, have detrimental effects on translation quality, and (iii) When paired with full-visible masking for source sequences, LMs could perform on par with EncDec on supervised bilingual and multilingual translation tasks, and improve greatly on zero-shot directions by facilitating the reduction of off-target translations.

APA

Zhang, B., Ghorbani, B., Bapna, A., Cheng, Y., Garcia, X., Shen, J. & Firat, O.. (2022). Examining Scaling and Transfer of Language Model Architectures for Machine Translation. Proceedings of the 39th International Conference on Machine Learning, in Proceedings of Machine Learning Research 162:26176-26192 Available from https://proceedings.mlr.press/v162/zhang22h.html.

Related Material

Download PDF