DOGE: Domain Reweighting with Generalization Estimation

Simin Fan; Matteo Pagliardini; Martin Jaggi

DOGE: Domain Reweighting with Generalization Estimation

Simin Fan, Matteo Pagliardini, Martin Jaggi

Proceedings of the 41st International Conference on Machine Learning, PMLR 235:12895-12915, 2024.

Abstract

The coverage and composition of the pretraining data significantly impacts the generalization ability of Large Language Models (LLMs). Despite its importance, recent LLMs still rely on heuristics and trial and error to increase or reduce the influence of data-domains. We propose DOmain reweighting with Generalization Estimation (DoGE), which optimizes the probability of sampling from each domain (domain weights) in a principled way. Our approach is a two stage process consisting (i) training a proxy model to obtain domain weights using a bi-level optimization algorithm; (ii) training a larger base model by sampling training domains according to the learnt domain weights. In our experiments, we extensively show how DoGE improves the generalization of the base model to any target data mixture. On the SlimPajama dataset, our base model gets a better perplexity and few-shot reasoning accuracies across 6 tasks compared to baseline methods. Moreover, aiming to generalize to out-of-domain target tasks, which is unseen in the pretraining corpus (OOD domain), DoGE can effectively identify inter-domain dependencies, consistently achieves better test perplexity on the target domain.

Cite this Paper

BibTeX


@InProceedings{pmlr-v235-fan24e,
  title = 	 {{DOGE}: Domain Reweighting with Generalization Estimation},
  author =       {Fan, Simin and Pagliardini, Matteo and Jaggi, Martin},
  booktitle = 	 {Proceedings of the 41st International Conference on Machine Learning},
  pages = 	 {12895--12915},
  year = 	 {2024},
  editor = 	 {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix},
  volume = 	 {235},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {21--27 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://raw.githubusercontent.com/mlresearch/v235/main/assets/fan24e/fan24e.pdf},
  url = 	 {https://proceedings.mlr.press/v235/fan24e.html},
  abstract = 	 {The coverage and composition of the pretraining data significantly impacts the generalization ability of Large Language Models (LLMs). Despite its importance, recent LLMs still rely on heuristics and trial and error to increase or reduce the influence of data-domains. We propose DOmain reweighting with Generalization Estimation (DoGE), which optimizes the probability of sampling from each domain (domain weights) in a principled way. Our approach is a two stage process consisting (i) training a proxy model to obtain domain weights using a bi-level optimization algorithm; (ii) training a larger base model by sampling training domains according to the learnt domain weights. In our experiments, we extensively show how DoGE improves the generalization of the base model to any target data mixture. On the SlimPajama dataset, our base model gets a better perplexity and few-shot reasoning accuracies across 6 tasks compared to baseline methods. Moreover, aiming to generalize to out-of-domain target tasks, which is unseen in the pretraining corpus (OOD domain), DoGE can effectively identify inter-domain dependencies, consistently achieves better test perplexity on the target domain.}
}

Endnote

%0 Conference Paper
%T DOGE: Domain Reweighting with Generalization Estimation
%A Simin Fan
%A Matteo Pagliardini
%A Martin Jaggi
%B Proceedings of the 41st International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2024
%E Ruslan Salakhutdinov
%E Zico Kolter
%E Katherine Heller
%E Adrian Weller
%E Nuria Oliver
%E Jonathan Scarlett
%E Felix Berkenkamp	
%F pmlr-v235-fan24e
%I PMLR
%P 12895--12915
%U https://proceedings.mlr.press/v235/fan24e.html
%V 235
%X The coverage and composition of the pretraining data significantly impacts the generalization ability of Large Language Models (LLMs). Despite its importance, recent LLMs still rely on heuristics and trial and error to increase or reduce the influence of data-domains. We propose DOmain reweighting with Generalization Estimation (DoGE), which optimizes the probability of sampling from each domain (domain weights) in a principled way. Our approach is a two stage process consisting (i) training a proxy model to obtain domain weights using a bi-level optimization algorithm; (ii) training a larger base model by sampling training domains according to the learnt domain weights. In our experiments, we extensively show how DoGE improves the generalization of the base model to any target data mixture. On the SlimPajama dataset, our base model gets a better perplexity and few-shot reasoning accuracies across 6 tasks compared to baseline methods. Moreover, aiming to generalize to out-of-domain target tasks, which is unseen in the pretraining corpus (OOD domain), DoGE can effectively identify inter-domain dependencies, consistently achieves better test perplexity on the target domain.

APA


Fan, S., Pagliardini, M. & Jaggi, M.. (2024). DOGE: Domain Reweighting with Generalization Estimation. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:12895-12915 Available from https://proceedings.mlr.press/v235/fan24e.html.

DOGE: Domain Reweighting with Generalization Estimation

Abstract

Cite this Paper

Related Material