Scaling Laws for Generative Mixed-Modal Language Models

Armen Aghajanyan; Lili Yu; Alexis Conneau; Wei-Ning Hsu; Karen Hambardzumyan; Susan Zhang; Stephen Roller; Naman Goyal; Omer Levy; Luke Zettlemoyer

Scaling Laws for Generative Mixed-Modal Language Models

Armen Aghajanyan, Lili Yu, Alexis Conneau, Wei-Ning Hsu, Karen Hambardzumyan, Susan Zhang, Stephen Roller, Naman Goyal, Omer Levy, Luke Zettlemoyer

Proceedings of the 40th International Conference on Machine Learning, PMLR 202:265-279, 2023.

Abstract

Generative language models define distributions over sequences of tokens that can represent essentially any combination of data modalities (e.g., any permutation of image tokens from VQ-VAEs, speech tokens from HuBERT, BPE tokens for language or code, and so on). To better understand the scaling properties of such mixed-modal models, we conducted over 250 experiments using seven different modalities and model sizes ranging from 8 million to 30 billion, trained on 5-100 billion tokens. We report new mixed-modal scaling laws that unify the contributions of individual modalities and the interactions between them. Specifically, we explicitly model the optimal synergy and competition due to data and model size as an additive term to previous uni-modal scaling laws. We also find four empirical phenomena observed during the training, such as emergent coordinate-ascent style training that naturally alternates between modalities, guidelines for selecting critical hyper-parameters, and connections between mixed-modal competition and training stability. Finally, we test our scaling law by training a 30B speech-text model, which significantly outperforms the corresponding unimodal models. Overall, our research provides valuable insights into the design and training of mixed-modal generative models, an important new class of unified models that have unique distributional properties.

Cite this Paper

BibTeX


@InProceedings{pmlr-v202-aghajanyan23a,
  title = 	 {Scaling Laws for Generative Mixed-Modal Language Models},
  author =       {Aghajanyan, Armen and Yu, Lili and Conneau, Alexis and Hsu, Wei-Ning and Hambardzumyan, Karen and Zhang, Susan and Roller, Stephen and Goyal, Naman and Levy, Omer and Zettlemoyer, Luke},
  booktitle = 	 {Proceedings of the 40th International Conference on Machine Learning},
  pages = 	 {265--279},
  year = 	 {2023},
  editor = 	 {Krause, Andreas and Brunskill, Emma and Cho, Kyunghyun and Engelhardt, Barbara and Sabato, Sivan and Scarlett, Jonathan},
  volume = 	 {202},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {23--29 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v202/aghajanyan23a/aghajanyan23a.pdf},
  url = 	 {https://proceedings.mlr.press/v202/aghajanyan23a.html},
  abstract = 	 {Generative language models define distributions over sequences of tokens that can represent essentially any combination of data modalities (e.g., any permutation of image tokens from VQ-VAEs, speech tokens from HuBERT, BPE tokens for language or code, and so on). To better understand the scaling properties of such mixed-modal models, we conducted over 250 experiments using seven different modalities and model sizes ranging from 8 million to 30 billion, trained on 5-100 billion tokens. We report new mixed-modal scaling laws that unify the contributions of individual modalities and the interactions between them. Specifically, we explicitly model the optimal synergy and competition due to data and model size as an additive term to previous uni-modal scaling laws. We also find four empirical phenomena observed during the training, such as emergent coordinate-ascent style training that naturally alternates between modalities, guidelines for selecting critical hyper-parameters, and connections between mixed-modal competition and training stability. Finally, we test our scaling law by training a 30B speech-text model, which significantly outperforms the corresponding unimodal models. Overall, our research provides valuable insights into the design and training of mixed-modal generative models, an important new class of unified models that have unique distributional properties.}
}

Endnote

%0 Conference Paper
%T Scaling Laws for Generative Mixed-Modal Language Models
%A Armen Aghajanyan
%A Lili Yu
%A Alexis Conneau
%A Wei-Ning Hsu
%A Karen Hambardzumyan
%A Susan Zhang
%A Stephen Roller
%A Naman Goyal
%A Omer Levy
%A Luke Zettlemoyer
%B Proceedings of the 40th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2023
%E Andreas Krause
%E Emma Brunskill
%E Kyunghyun Cho
%E Barbara Engelhardt
%E Sivan Sabato
%E Jonathan Scarlett	
%F pmlr-v202-aghajanyan23a
%I PMLR
%P 265--279
%U https://proceedings.mlr.press/v202/aghajanyan23a.html
%V 202
%X Generative language models define distributions over sequences of tokens that can represent essentially any combination of data modalities (e.g., any permutation of image tokens from VQ-VAEs, speech tokens from HuBERT, BPE tokens for language or code, and so on). To better understand the scaling properties of such mixed-modal models, we conducted over 250 experiments using seven different modalities and model sizes ranging from 8 million to 30 billion, trained on 5-100 billion tokens. We report new mixed-modal scaling laws that unify the contributions of individual modalities and the interactions between them. Specifically, we explicitly model the optimal synergy and competition due to data and model size as an additive term to previous uni-modal scaling laws. We also find four empirical phenomena observed during the training, such as emergent coordinate-ascent style training that naturally alternates between modalities, guidelines for selecting critical hyper-parameters, and connections between mixed-modal competition and training stability. Finally, we test our scaling law by training a 30B speech-text model, which significantly outperforms the corresponding unimodal models. Overall, our research provides valuable insights into the design and training of mixed-modal generative models, an important new class of unified models that have unique distributional properties.

APA


Aghajanyan, A., Yu, L., Conneau, A., Hsu, W., Hambardzumyan, K., Zhang, S., Roller, S., Goyal, N., Levy, O. & Zettlemoyer, L.. (2023). Scaling Laws for Generative Mixed-Modal Language Models. Proceedings of the 40th International Conference on Machine Learning, in Proceedings of Machine Learning Research 202:265-279 Available from https://proceedings.mlr.press/v202/aghajanyan23a.html.

Scaling Laws for Generative Mixed-Modal Language Models

Abstract

Cite this Paper

Related Material