Model Performance Scaling with Multiple Data Sources

Tatsunori Hashimoto

Model Performance Scaling with Multiple Data Sources

Tatsunori Hashimoto

Proceedings of the 38th International Conference on Machine Learning, PMLR 139:4107-4116, 2021.

Abstract

Real-world machine learning systems are often trained using a mix of data sources with varying cost and quality. Understanding how the size and composition of a training dataset affect model performance is critical for advancing our understanding of generalization, as well as designing more effective data collection policies. We show that there is a simple scaling law that predicts the loss incurred by a model even under varying dataset composition. Our work expands recent observations of scaling laws for log-linear generalization error in the i.i.d setting and uses this to cast model performance prediction as a learning problem. Using the theory of optimal experimental design, we derive a simple rational function approximation to generalization error that can be fitted using a few model training runs. Our approach can achieve highly accurate (

$r^2\approx .9$ ) predictions of model performance under substantial extrapolation in two different standard supervised learning tasks and is accurate (

$r^2 \approx .83$ ) on more challenging machine translation and question answering tasks where many baselines achieve worse-than-random performance.

Cite this Paper

BibTeX


@InProceedings{pmlr-v139-hashimoto21a,
  title = 	 {Model Performance Scaling with Multiple Data Sources},
  author =       {Hashimoto, Tatsunori},
  booktitle = 	 {Proceedings of the 38th International Conference on Machine Learning},
  pages = 	 {4107--4116},
  year = 	 {2021},
  editor = 	 {Meila, Marina and Zhang, Tong},
  volume = 	 {139},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {18--24 Jul},
  publisher =    {PMLR},
  pdf = 	 {http://proceedings.mlr.press/v139/hashimoto21a/hashimoto21a.pdf},
  url = 	 {https://proceedings.mlr.press/v139/hashimoto21a.html},
  abstract = 	 {Real-world machine learning systems are often trained using a mix of data sources with varying cost and quality. Understanding how the size and composition of a training dataset affect model performance is critical for advancing our understanding of generalization, as well as designing more effective data collection policies. We show that there is a simple scaling law that predicts the loss incurred by a model even under varying dataset composition. Our work expands recent observations of scaling laws for log-linear generalization error in the i.i.d setting and uses this to cast model performance prediction as a learning problem. Using the theory of optimal experimental design, we derive a simple rational function approximation to generalization error that can be fitted using a few model training runs. Our approach can achieve highly accurate ($r^2\approx .9$) predictions of model performance under substantial extrapolation in two different standard supervised learning tasks and is accurate ($r^2 \approx .83$) on more challenging machine translation and question answering tasks where many baselines achieve worse-than-random performance.}
}

Endnote

%0 Conference Paper
%T Model Performance Scaling with Multiple Data Sources
%A Tatsunori Hashimoto
%B Proceedings of the 38th International Conference on Machine Learning
%C Proceedings of Machine Learning Research
%D 2021
%E Marina Meila
%E Tong Zhang	
%F pmlr-v139-hashimoto21a
%I PMLR
%P 4107--4116
%U https://proceedings.mlr.press/v139/hashimoto21a.html
%V 139
%X Real-world machine learning systems are often trained using a mix of data sources with varying cost and quality. Understanding how the size and composition of a training dataset affect model performance is critical for advancing our understanding of generalization, as well as designing more effective data collection policies. We show that there is a simple scaling law that predicts the loss incurred by a model even under varying dataset composition. Our work expands recent observations of scaling laws for log-linear generalization error in the i.i.d setting and uses this to cast model performance prediction as a learning problem. Using the theory of optimal experimental design, we derive a simple rational function approximation to generalization error that can be fitted using a few model training runs. Our approach can achieve highly accurate ($r^2\approx .9$) predictions of model performance under substantial extrapolation in two different standard supervised learning tasks and is accurate ($r^2 \approx .83$) on more challenging machine translation and question answering tasks where many baselines achieve worse-than-random performance.

APA


Hashimoto, T.. (2021). Model Performance Scaling with Multiple Data Sources. Proceedings of the 38th International Conference on Machine Learning, in Proceedings of Machine Learning Research 139:4107-4116 Available from https://proceedings.mlr.press/v139/hashimoto21a.html.

Model Performance Scaling with Multiple Data Sources

Abstract

Cite this Paper

Related Material