A Probabilistic Method to Predict Classifier Accuracy on Larger Datasets given Small Pilot Data

Ethan Harvey, Wansu Chen, David M. Kent, Michael C. Hughes
Proceedings of the 3rd Machine Learning for Health Symposium, PMLR 225:129-144, 2023.

Abstract

Practitioners building classifiers often start with a smaller pilot dataset and plan to grow to larger data in the near future. Such projects need a toolkit for extrapolating how much classifier accuracy may improve from a 2x, 10x, or 50x increase in data size. While existing work has focused on finding a single “best-fit” curve using various functional forms like power laws, we argue that modeling and assessing the uncertainty of predictions is critical yet has seen less attention. In this paper, we propose a Gaussian process model to obtain probabilistic extrapolations of accuracy or similar performance metrics as dataset size increases. We evaluate our approach in terms of error, likelihood, and coverage across six datasets. Though we focus on medical tasks and image modalities, our open source approach generalizes to any kind of classifier.

Cite this Paper


BibTeX
@InProceedings{pmlr-v225-harvey23a, title = {A Probabilistic Method to Predict Classifier Accuracy on Larger Datasets given Small Pilot Data}, author = {Harvey, Ethan and Chen, Wansu and Kent, David M. and Hughes, Michael C.}, booktitle = {Proceedings of the 3rd Machine Learning for Health Symposium}, pages = {129--144}, year = {2023}, editor = {Hegselmann, Stefan and Parziale, Antonio and Shanmugam, Divya and Tang, Shengpu and Asiedu, Mercy Nyamewaa and Chang, Serina and Hartvigsen, Tom and Singh, Harvineet}, volume = {225}, series = {Proceedings of Machine Learning Research}, month = {10 Dec}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v225/harvey23a/harvey23a.pdf}, url = {https://proceedings.mlr.press/v225/harvey23a.html}, abstract = {Practitioners building classifiers often start with a smaller pilot dataset and plan to grow to larger data in the near future. Such projects need a toolkit for extrapolating how much classifier accuracy may improve from a 2x, 10x, or 50x increase in data size. While existing work has focused on finding a single “best-fit” curve using various functional forms like power laws, we argue that modeling and assessing the uncertainty of predictions is critical yet has seen less attention. In this paper, we propose a Gaussian process model to obtain probabilistic extrapolations of accuracy or similar performance metrics as dataset size increases. We evaluate our approach in terms of error, likelihood, and coverage across six datasets. Though we focus on medical tasks and image modalities, our open source approach generalizes to any kind of classifier.} }
Endnote
%0 Conference Paper %T A Probabilistic Method to Predict Classifier Accuracy on Larger Datasets given Small Pilot Data %A Ethan Harvey %A Wansu Chen %A David M. Kent %A Michael C. Hughes %B Proceedings of the 3rd Machine Learning for Health Symposium %C Proceedings of Machine Learning Research %D 2023 %E Stefan Hegselmann %E Antonio Parziale %E Divya Shanmugam %E Shengpu Tang %E Mercy Nyamewaa Asiedu %E Serina Chang %E Tom Hartvigsen %E Harvineet Singh %F pmlr-v225-harvey23a %I PMLR %P 129--144 %U https://proceedings.mlr.press/v225/harvey23a.html %V 225 %X Practitioners building classifiers often start with a smaller pilot dataset and plan to grow to larger data in the near future. Such projects need a toolkit for extrapolating how much classifier accuracy may improve from a 2x, 10x, or 50x increase in data size. While existing work has focused on finding a single “best-fit” curve using various functional forms like power laws, we argue that modeling and assessing the uncertainty of predictions is critical yet has seen less attention. In this paper, we propose a Gaussian process model to obtain probabilistic extrapolations of accuracy or similar performance metrics as dataset size increases. We evaluate our approach in terms of error, likelihood, and coverage across six datasets. Though we focus on medical tasks and image modalities, our open source approach generalizes to any kind of classifier.
APA
Harvey, E., Chen, W., Kent, D.M. & Hughes, M.C.. (2023). A Probabilistic Method to Predict Classifier Accuracy on Larger Datasets given Small Pilot Data. Proceedings of the 3rd Machine Learning for Health Symposium, in Proceedings of Machine Learning Research 225:129-144 Available from https://proceedings.mlr.press/v225/harvey23a.html.

Related Material