Scaling Laws for the Value of Individual Data Points in Machine Learning

Ian Connick Covert, Wenlong Ji, Tatsunori Hashimoto, James Zou
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:9380-9406, 2024.

Abstract

Recent works have shown that machine learning models improve at a predictable rate with the amount of training data, leading to scaling laws that describe the relationship between error and dataset size. These scaling laws can help determine a model’s training dataset, but they take an aggregate view of the data by only considering the dataset’s size. We consider a new perspective by investigating scaling behavior for the value of individual data points: we find that a data point’s contribution to model’s performance shrinks predictably with the size of the dataset in a log-linear manner. Interestingly, there is significant variability in the scaling exponent among different data points, indicating that certain points are more valuable in small datasets and other points are relatively more useful as a part of large datasets. We provide learning theory support for our scaling laws and we observe empirically that it holds across several model classes. We further propose a maximum likelihood estimator and an amortized estimator to efficiently learn the individualized scaling behaviors from a small number of noisy observations per data point. Using our efficient estimators, we provide insights into factors that influence the scaling behavior of different data points. Finally we demonstrate applications of the individualized scaling laws to data valuation and data subset selection.

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-covert24a, title = {Scaling Laws for the Value of Individual Data Points in Machine Learning}, author = {Covert, Ian Connick and Ji, Wenlong and Hashimoto, Tatsunori and Zou, James}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {9380--9406}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/covert24a/covert24a.pdf}, url = {https://proceedings.mlr.press/v235/covert24a.html}, abstract = {Recent works have shown that machine learning models improve at a predictable rate with the amount of training data, leading to scaling laws that describe the relationship between error and dataset size. These scaling laws can help determine a model’s training dataset, but they take an aggregate view of the data by only considering the dataset’s size. We consider a new perspective by investigating scaling behavior for the value of individual data points: we find that a data point’s contribution to model’s performance shrinks predictably with the size of the dataset in a log-linear manner. Interestingly, there is significant variability in the scaling exponent among different data points, indicating that certain points are more valuable in small datasets and other points are relatively more useful as a part of large datasets. We provide learning theory support for our scaling laws and we observe empirically that it holds across several model classes. We further propose a maximum likelihood estimator and an amortized estimator to efficiently learn the individualized scaling behaviors from a small number of noisy observations per data point. Using our efficient estimators, we provide insights into factors that influence the scaling behavior of different data points. Finally we demonstrate applications of the individualized scaling laws to data valuation and data subset selection.} }
Endnote
%0 Conference Paper %T Scaling Laws for the Value of Individual Data Points in Machine Learning %A Ian Connick Covert %A Wenlong Ji %A Tatsunori Hashimoto %A James Zou %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-covert24a %I PMLR %P 9380--9406 %U https://proceedings.mlr.press/v235/covert24a.html %V 235 %X Recent works have shown that machine learning models improve at a predictable rate with the amount of training data, leading to scaling laws that describe the relationship between error and dataset size. These scaling laws can help determine a model’s training dataset, but they take an aggregate view of the data by only considering the dataset’s size. We consider a new perspective by investigating scaling behavior for the value of individual data points: we find that a data point’s contribution to model’s performance shrinks predictably with the size of the dataset in a log-linear manner. Interestingly, there is significant variability in the scaling exponent among different data points, indicating that certain points are more valuable in small datasets and other points are relatively more useful as a part of large datasets. We provide learning theory support for our scaling laws and we observe empirically that it holds across several model classes. We further propose a maximum likelihood estimator and an amortized estimator to efficiently learn the individualized scaling behaviors from a small number of noisy observations per data point. Using our efficient estimators, we provide insights into factors that influence the scaling behavior of different data points. Finally we demonstrate applications of the individualized scaling laws to data valuation and data subset selection.
APA
Covert, I.C., Ji, W., Hashimoto, T. & Zou, J.. (2024). Scaling Laws for the Value of Individual Data Points in Machine Learning. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:9380-9406 Available from https://proceedings.mlr.press/v235/covert24a.html.

Related Material