Restoring balance: principled under/oversampling of data for optimal classification

Emanuele Loffredo, Mauro Pastore, Simona Cocco, Remi Monasson
Proceedings of the 41st International Conference on Machine Learning, PMLR 235:32643-32670, 2024.

Abstract

Class imbalance in real-world data poses a common bottleneck for machine learning tasks, since achieving good generalization on under-represented examples is often challenging. Mitigation strategies, such as under or oversampling the data depending on their abundances, are routinely proposed and tested empirically, but how they should adapt to the data statistics remains poorly understood. In this work, we determine exact analytical expressions of the generalization curves in the high-dimensional regime for linear classifiers (Support Vector Machines). We also provide a sharp prediction of the effects of under/oversampling strategies depending on class imbalance, first and second moments of the data, and the metrics of performance considered. We show that mixed strategies involving under and oversampling of data lead to performance improvement. Through numerical experiments, we show the relevance of our theoretical predictions on real datasets, on deeper architectures and with sampling strategies based on unsupervised probabilistic models.

Cite this Paper


BibTeX
@InProceedings{pmlr-v235-loffredo24a, title = {Restoring balance: principled under/oversampling of data for optimal classification}, author = {Loffredo, Emanuele and Pastore, Mauro and Cocco, Simona and Monasson, Remi}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, pages = {32643--32670}, year = {2024}, editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix}, volume = {235}, series = {Proceedings of Machine Learning Research}, month = {21--27 Jul}, publisher = {PMLR}, pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/loffredo24a/loffredo24a.pdf}, url = {https://proceedings.mlr.press/v235/loffredo24a.html}, abstract = {Class imbalance in real-world data poses a common bottleneck for machine learning tasks, since achieving good generalization on under-represented examples is often challenging. Mitigation strategies, such as under or oversampling the data depending on their abundances, are routinely proposed and tested empirically, but how they should adapt to the data statistics remains poorly understood. In this work, we determine exact analytical expressions of the generalization curves in the high-dimensional regime for linear classifiers (Support Vector Machines). We also provide a sharp prediction of the effects of under/oversampling strategies depending on class imbalance, first and second moments of the data, and the metrics of performance considered. We show that mixed strategies involving under and oversampling of data lead to performance improvement. Through numerical experiments, we show the relevance of our theoretical predictions on real datasets, on deeper architectures and with sampling strategies based on unsupervised probabilistic models.} }
Endnote
%0 Conference Paper %T Restoring balance: principled under/oversampling of data for optimal classification %A Emanuele Loffredo %A Mauro Pastore %A Simona Cocco %A Remi Monasson %B Proceedings of the 41st International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2024 %E Ruslan Salakhutdinov %E Zico Kolter %E Katherine Heller %E Adrian Weller %E Nuria Oliver %E Jonathan Scarlett %E Felix Berkenkamp %F pmlr-v235-loffredo24a %I PMLR %P 32643--32670 %U https://proceedings.mlr.press/v235/loffredo24a.html %V 235 %X Class imbalance in real-world data poses a common bottleneck for machine learning tasks, since achieving good generalization on under-represented examples is often challenging. Mitigation strategies, such as under or oversampling the data depending on their abundances, are routinely proposed and tested empirically, but how they should adapt to the data statistics remains poorly understood. In this work, we determine exact analytical expressions of the generalization curves in the high-dimensional regime for linear classifiers (Support Vector Machines). We also provide a sharp prediction of the effects of under/oversampling strategies depending on class imbalance, first and second moments of the data, and the metrics of performance considered. We show that mixed strategies involving under and oversampling of data lead to performance improvement. Through numerical experiments, we show the relevance of our theoretical predictions on real datasets, on deeper architectures and with sampling strategies based on unsupervised probabilistic models.
APA
Loffredo, E., Pastore, M., Cocco, S. & Monasson, R.. (2024). Restoring balance: principled under/oversampling of data for optimal classification. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:32643-32670 Available from https://proceedings.mlr.press/v235/loffredo24a.html.

Related Material