Model Performance Scaling with Multiple Data Sources

Tatsunori Hashimoto
Proceedings of the 38th International Conference on Machine Learning, PMLR 139:4107-4116, 2021.

Abstract

Real-world machine learning systems are often trained using a mix of data sources with varying cost and quality. Understanding how the size and composition of a training dataset affect model performance is critical for advancing our understanding of generalization, as well as designing more effective data collection policies. We show that there is a simple scaling law that predicts the loss incurred by a model even under varying dataset composition. Our work expands recent observations of scaling laws for log-linear generalization error in the i.i.d setting and uses this to cast model performance prediction as a learning problem. Using the theory of optimal experimental design, we derive a simple rational function approximation to generalization error that can be fitted using a few model training runs. Our approach can achieve highly accurate (r2.9) predictions of model performance under substantial extrapolation in two different standard supervised learning tasks and is accurate (r2.83) on more challenging machine translation and question answering tasks where many baselines achieve worse-than-random performance.

Cite this Paper


BibTeX
@InProceedings{pmlr-v139-hashimoto21a, title = {Model Performance Scaling with Multiple Data Sources}, author = {Hashimoto, Tatsunori}, booktitle = {Proceedings of the 38th International Conference on Machine Learning}, pages = {4107--4116}, year = {2021}, editor = {Meila, Marina and Zhang, Tong}, volume = {139}, series = {Proceedings of Machine Learning Research}, month = {18--24 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v139/hashimoto21a/hashimoto21a.pdf}, url = {https://proceedings.mlr.press/v139/hashimoto21a.html}, abstract = {Real-world machine learning systems are often trained using a mix of data sources with varying cost and quality. Understanding how the size and composition of a training dataset affect model performance is critical for advancing our understanding of generalization, as well as designing more effective data collection policies. We show that there is a simple scaling law that predicts the loss incurred by a model even under varying dataset composition. Our work expands recent observations of scaling laws for log-linear generalization error in the i.i.d setting and uses this to cast model performance prediction as a learning problem. Using the theory of optimal experimental design, we derive a simple rational function approximation to generalization error that can be fitted using a few model training runs. Our approach can achieve highly accurate ($r^2\approx .9$) predictions of model performance under substantial extrapolation in two different standard supervised learning tasks and is accurate ($r^2 \approx .83$) on more challenging machine translation and question answering tasks where many baselines achieve worse-than-random performance.} }
Endnote
%0 Conference Paper %T Model Performance Scaling with Multiple Data Sources %A Tatsunori Hashimoto %B Proceedings of the 38th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2021 %E Marina Meila %E Tong Zhang %F pmlr-v139-hashimoto21a %I PMLR %P 4107--4116 %U https://proceedings.mlr.press/v139/hashimoto21a.html %V 139 %X Real-world machine learning systems are often trained using a mix of data sources with varying cost and quality. Understanding how the size and composition of a training dataset affect model performance is critical for advancing our understanding of generalization, as well as designing more effective data collection policies. We show that there is a simple scaling law that predicts the loss incurred by a model even under varying dataset composition. Our work expands recent observations of scaling laws for log-linear generalization error in the i.i.d setting and uses this to cast model performance prediction as a learning problem. Using the theory of optimal experimental design, we derive a simple rational function approximation to generalization error that can be fitted using a few model training runs. Our approach can achieve highly accurate ($r^2\approx .9$) predictions of model performance under substantial extrapolation in two different standard supervised learning tasks and is accurate ($r^2 \approx .83$) on more challenging machine translation and question answering tasks where many baselines achieve worse-than-random performance.
APA
Hashimoto, T.. (2021). Model Performance Scaling with Multiple Data Sources. Proceedings of the 38th International Conference on Machine Learning, in Proceedings of Machine Learning Research 139:4107-4116 Available from https://proceedings.mlr.press/v139/hashimoto21a.html.

Related Material